Speaker 1 (00:04):
This is the Data Science Conversations Podcast with Damien Deighan and Dr. Philipp Diesinger. We feature cutting edge data science and AI research from the world’s leading academic minds and industry practitioners so you can expand your knowledge and grow your career. This podcast is sponsored by Data Science, Talent, Data Science Recruitment Experts.
Speaker 1 (00:29):
Welcome to the latest episode of the Data Science Conversations podcast. My name is Damien Deighan, and I’m here with my co-host, Dr. Philipp Diesinger. Today we’re discussing the rapidly developing field of satellite imaging with two remarkable industry data scientists with a strong academic pedigree. Joining us from Washington DC is Heidi Hurst. Heidi works at the intersection of Maths geographic information science or GIS Computer Science and design. She did her applied math degree at Harvard with the university’s first ever focus in navigation and GIS. She then received, uh, her GIS Masters from University College London, where her dissertation research focused on the impact of image resolution on detecting cars in satellite imagery. Heidi completed a second Masters in mathematical modeling and scientific computing at Oxford University where her research focused on unitary recurrent neural networks. Currently, she’s a computer vision scientist at Orbital Insight where she adapts state-of-the-art computer vision algorithms for use on satellite imagery for public and private sector clients. In addition to the technical challenges of machine learning, Heidi is passionate about the ethical application of machine learning and AI, and has previously consulted government clients in emergency management, defense and intelligence in Washington DC. Heidi, welcome to the show. Thanks.
Speaker 2 (02:00):
It’s a pleasure to be here
Speaker 1 (02:01):
Joining us from New York is Jerry He. Jerry’s undergraduate degree from Williams College was in physics, Maths, and Economics. From there, he also moved to the UK and did his Master’s at Cambridge in Advanced Maths. His PhD also undertaken at Cambridge was in Management Science and Operations Research. Jerry started his early career in the finance world as a quant analyst and trader before making the move into his current position at Intellect Design. There he is a data scientist using satellite imaging data for a wide variety of interesting insurance and banking related business cases. Jerry, thank you so much for joining us. Okay. Starting with you Heidi, what motivated you to work in the field of satellite imagery for such a long time?
Speaker 2 (02:51):
As you mentioned, my background is in math, and I was always passionate about the intersection of mathematics and geography and looking at how can we bring mathematical methods to bear on geographic questions. And so that sort of naturally led to an interest in satellite imagery. If we have access to not constructed maps, not survey maps, but realtime or near realtime or increasingly realtime information about the surface of the planet, you know, what insights can we gather? And so I realized that I could use a lot of the same mathematical tools that I had learned to apply them, to satellite imagery, to extract these geographic impacts. So, my interest was always in applying math to these, math related questions.
Speaker 1 (03:32):
And what’s most interesting, to you about the field of satellite imagery?
Speaker 2 (03:39):
You know, I think one assumption that folks make when we talked about anything geographic is, or a piece of feedback that I had gotten when I started studying this was, Heidi, we already know where things are. What are you looking for? And the surface of the earth is always changing. The location of cars is always changing. The buildings that folks are putting up for taking down, it’s always changing. And so what excites me most about this field is an opportunity to have an unprecedented view, almost as if you’re an astronaut at the International Space Station of the surface of the earth, to understand how things are changing and then to trace those changes all the way down to the level of impact, at the individual level of the societal level, and really understand how systems are connected.
Speaker 1 (04:19):
And Jerry, you perhaps took a more unusual route into the field. How did you go from being a quant in the finance world to working on satellite imagery?
Speaker 3 (04:31):
I guess since, undergrad, I’ve always had curiosity in more than one field as evidenced by the triple major. I wanted some mathematical modeling, so I did physics and math, but I also had an interest in economics. And when I got a full scholarship to go to Cambridge, I first studied the certificate in advanced studies, which is just pure math. And, when I did that, I felt like I was getting more and more into higher level maths that fewer and fewer people care about. I mean, they care about very intensely, but it’s definitely fewer people. So I took a huge pivot into the business school, which I guess is something I wanted to do because they also apply theoretical concepts and mathematical models, but in a way, they care more about how to apply them to solve everyday problems that affect society in a positive way.
Speaker 3 (05:27):
So in that sense, I did my PhD and I looked at the pharmaceutical industry, which is very interesting in the sense that they spent billions of dollars and over multiple decades to develop a drug that most of the time their efforts are a failure. So this is a huge management issue that if you do it well, that results in a reward. But if most of the time do that effort results in a failure, how do you allocate your resources, sort of adequately across your different intellectual, properties? And, that drove me in a lot of interest in sort of risk management. that sort of drove me into finance. So when I arrived at XR trading, I was a quant trader. So I write algorithms that feed into the trading algorithm, which makes trading decisions in real time with no human intervention involved.
Speaker 3 (06:25):
The humans are just there to stop if the trading goes wrong. And so, that was a very educational experience for me. But, ultimately, when you are doing something at very high speeds, which you, have to go for the simplest model that works. So you, have to have a very fast GPU cpu, and you have to have code that’s basically c+++, like assembly code that has to run extremely fast. So in that sense, that limits the complexity of the models that you can put in your trading algorithms in a way. So intellectually, I wanted to pursue more. So after that, I actually took up an IQ beta program, ran by, a guy I met at Cambridge, actually, he was doing the, the same part three, certificate of advanced mathematics course I was doing. And he went out and founded this company and, that gave me, a few months of data science training.
Speaker 3 (07:24):
And, it was through that training that, I got onto, intellect seek, which was doing a lot of interesting thing things at a time. And what I got to do there, besides, natural language processing was this idea that you can look at a satellite imagery and then you’ll be able to tell if a certain location has flooding hazard or if a certain location has hazard of fire, for example, historically. So you have these ideas that are very, very new and very, very cutting edge. And, I went into that and that was more intellectually satisfying in the sense that our models <laugh> are sometimes so complex that, if you run it on a conventional desktop, it would take probably weeks, months to finish.
Speaker 1 (08:15):
So satellite imaging, saved you from the clutches of a career on Wall Street, is what you’re saying?
Speaker 3 (08:21):
<laugh> Yeah. So, Wall Street was definitely very interesting in the sense that, you know, you have to be able to multitask. You have like CNBC playing in the background a lot of times because, you know, the breaking news could really affect your decision whether or not to stop your algorithm from trading. But at the same time, you’re also doing R and D and you’re also hearing traders shouting across the room. And basically you have to support, some of the other traders with the tools that you are developing. So you have to do almost three tasks at once, at all times. Whereas, I guess that with the satellite image I’m doing now, it’s more like research that I’m in PhD, I’m focusing on one problem at a time, and I get to sort of take my time with that and taking it sort of one step at a time in a, in a very methodical and thorough way.
Speaker 4 (09:14):
Heidi, satellite image data differs from image data that we use in our daily lives. For instance, data from pictures taken with mobile devices. Can you explain what those differences are?
Speaker 2 (09:25):
So often when folks think about computer vision or processing imagery, you know, you see examples of models that can detect the difference between cat and what we’re doing with satellite imagery is in some ways similar, but in some ways very different. Cause the imagery itself is very different. So the type of imagery that comes from your phone has three channels, RGB, red, green, blue, some satellite imagery has those channels as well, but it may have additional channels. So I’d say there’s sort of three differences between regular imagery and satellite imagery. One is the resolution. The second is the band, so the wavelengths that are captured by the sensors. And the third is the spatial information or metadata that comes with this. So, I can kinda go through those one by one. The first was resolution. So in satellite imagery, the resolution is limited by the sensor that you’re working with.
Speaker 2 (10:11):
So common resolutions, if you have something that’s very high resolution, each pixel might correspond to 20 centimeters on the ground. If you have something that’s very low resolution, like the land hat satellites, it might correspond to 15 meters. But the resolution, has you physical meaning in the real world. The second is the spectral. So as I mentioned, traditional imagery, just take a picture with your phone. It’ll have three bands, red, green, and blue, with satellite imagery, some satellites have additional bands, so they’ll have, near infrared bands or pan chromatic bands, that provide additional information that can be used to detect things, that humans can’t see. Which, again, from a data processing perspective is a far more interesting question. I’m sure will speak more to this later, but we don’t just wanna train the algorithms to see what humans can, we wanna train them to see beyond what humans can. And having access to, spectral bands outside of what we can see is helpful for that. And then the last point about the difference was just the spatial information and the metadata. So when you take an information, take an image from a satellite, it contains information about where on earth that is, and the altitude, the angle, the time of day, all of this provides additional metadata that can be used to build advanced models, um, about what’s happening within that image.
Speaker 4 (11:27):
So if I understand correctly, passive sensors use sunlight as the primary light source, while active sensors don’t rely on sunlight, but emit their own electromagnetic waves, like radar, for instance, using the sun as the primary light source is convenient, but also comes with some limitations, clouds, haze and shadows, for instance, can obstruct parts of the image. How do you deal with these problems?
Speaker 3 (11:50):
Exactly? Yes. So, I can actually give you a real example, which is that if a lot of times, the sun will be shining on one side of the hotel and that will cast a shadow over the swimming pool, which makes it difficult. In that case, you can compliment, your model with infrared data. So the large body of the water tend to have a different temperature than the surroundings. So therefore, the infrared data will then point to you that there is a roughly circular, object over there that looks like a swimming pool that’s of a different temperature than the surrounding that has a very consistent temperature characteristic to it.
Speaker 2 (12:26):
And I can also speak to this question of passive versus active sensors a little bit. Most of the work that we do is with EO data, electro optical data. We also do some work with SAR, which is synthetic aperture radar. So synthetic aperture radar is an active sensor that can see through clouds, for example. And in addition to seeing through clouds, which is obviously, a difficult occlusion that optical sensors can’t penetrate, it’s also useful for imaging at. So imagine if you were trying to track the location of ships, if you have an adversarial fleet, for example, that you wanna monitor, the movement of using an active sensor like SAR at night would allow you to maintain information and situational awareness that traditional overhead imagery from a passive sensor wouldn’t allow you to maintain.
Speaker 4 (13:08):
Heidi, can you explain a bit more how the data is acquired and then also, which types of data you are actually using for your analysis?
Speaker 2 (13:17):
So there’s a variety of different satellites out there that have different resolution, and in addition, there are a number of other platforms besides just satellites. So speaking on the satellite side of things, there are commercially available sources. So Planet Labs, digital Globe, which think has been bought by Maxar. Those are commercially available sources. They’re also government sources that are publicly available. So for example, Landsat data, is very coarse resolution data that’s good for land cover and vegetation indices, that sort of work. And then of course, there’s also government sources that are not publicly available. But in addition to the satellite imagery sources, there are other sources from lower altitude platforms. So in particular, one area of interest right now in terms of research and development is something called HATS, which is high altitude platform systems. These are systems that operate at the altitude of a weather balloon, so lower than a satellite, higher than a plane.
Speaker 2 (14:12):
These are systems that can persist in the atmosphere for a significant amount of time on the order of hours, days, weeks, but not on the order of years like we see satellites. And the advantage of those is they can be beneath some of the clouds, and you can receive different types of imagery. So you can imagine if you have a similar sensor on a weather balloon than on a satellite, you’re gonna get higher resolution data and you’re also gonna be able to avoid some of the atmospheric influence from clouds and other things. So there’s a variety of sensors available in this space. And that’s not to mention, you know, the traditional imagery sources from aircraft or imagery sources from drones. There’s a lot of options out there.
Speaker 4 (14:52):
Jerry, we heard a lot about the different data sources now. Could you explain what are typical challenges that you face when working with this kind of data and how you overcome those?
Speaker 3 (15:05):
So data reliability is a big issue. And, I will start with that. As Heidi mentioned that, a lot of the satellite data, and you can see that directly from Google Maps as well, are outdated. So, in some cases the object you’re trying to see hasn’t been constructed yet. All you see is a construction site, which is also very important that you have multiple sources of satellite data. So when one data source fails, you then go to the next one, which possibly could cost more, but then it’s much more recent, like something like perhaps. And in terms of processing, the data, for the applications of insurance, it is very important to distinguish, I guess the boundaries of the property. And in this case, you can either crop out the bound the property you want and then black out everything else. Or you could also, in the absence of a good boundary, do an analysis on the entire image and then try to figure out the boundary yourselves. And, and that requires quite a lot of an algorithms that are, are both trained and also heuristic, that that adds up to sort of a whole that performs this task well,
Speaker 4 (16:29):
Heidi, I think you mentioned other challenges with the data, like the role of weather conditions where clouds can be obstructing parts of the images or higher boundary problems that need to be solved. How do you overcome those data limitations?
Speaker 2 (16:42):
There’s certainly no scarcity of challenges in this domain. I will say, you know, one issue, that you mentioned and that I’ve mentioned previously is weather. So you can imagine a lot of objects of interest. So in particular, my work focuses on object detection in the national security domain. You could imagine a lot of objects of interest aren’t found in perfectly sunny places with nice weather. In particular, you know, trying to find objects in snowy conditions, in winter conditions, in low light conditions, present very serious challenges, both from an object detection standpoint, but also as Jerry mentioned, from an imagery sourcing standpoint, if you have outdated imagery, it’s gonna be very difficult to find things. So that’s one challenge. Another challenge that we face, and I think this is a challenge that’s quite common to a lot of folks working across data science as a discipline, is data labeling.
Speaker 2 (17:30):
So if we’re building, training algorithm, or we’re building a detection algorithm, we need a training data set that contains appropriately labeled instances of whatever it’s we’re trying to detect. Now, in some cases, for example, we have commercial applications that count the number of cars in a parking lot. It’s not difficult to obtain and label a significant corpus of information to allow these algorithms to be successful. But in other cases, if we’re looking at detecting particularly rare objects, particularly small objects, or particularly hard to find objects, so think, you know, rare classes of aircraft in the winter, it’s very difficult to obtain the base data, that’s needed to train up these models. And actually, I’m happy to talk about that a little bit later when we talk about advances in the field, about how synthetic data can try and help fill some of those gaps.
Speaker 4 (18:22):
Jerry, you mentioned earlier that sometimes you have to deal with shadows that are being cast on the property that you want to analyze. This can also happen, if I understand correctly, when the light source, let’s say the sun is not exactly straight above the area of interest and rather lighting it at an angle, what do you do in such cases?
Speaker 3 (18:40):
Yeah, so shadows are a big problem because shadows cast certain boundaries, which then gets picked up by our boundary detection algorithms. And, then it would seem like, you know, it is not a swimming pool because of a shadow right across in the middle of it. And, I think in terms of, satellite not being on top of it, we see that as well. So basically you see a much more blurred image, rather than a sharp crystal image. And in some cases, there will be certain artifacts in the image as well, because I believe what happens is our data provider is cobbling to gather something from multiple satellites. And so, they are already doing some sort of pre-processing, which are not really made known to us. So in such cases, and again, as Heidi had mentioned, so there are rare cases where you would have artifacts in your image that would not be, picked up by machine learning algorithms just because they are rare.
Speaker 3 (19:38):
So for example, if you have umbrellas on the side of your, swimming pool, and if that’s less than say 0.5% of your dataset, any kind of machine learning algorithm will say, well, we don’t care about those because there’s so few of them. So, in this cases, you would need to then insert these artifacts into, for example, existing images. You need to figure out maybe heuristic or algorithmic ways to insert those artifacts into existing image to create a larger dataset where such artifacts would get recognized by your machine learning algorithm. So this is where, the synthetic data generation becomes very important.
Speaker 4 (20:16):
Heidi, it seems the field of satellite image is developing rapidly at the moment. Cost has been going down significantly already. And new advances in image processing, like the use of neural networks, for instance, make data analysis more and more powerful. What are your expectations regarding the developments in your field and what are you most excited about?
Speaker 2 (20:36):
I think you’re right to point out that this is an exciting field. I think it’s a really exciting time. I mean, we think back to satellite imagery, the very genesis of satellite imagery. This I think is wild. When the US government first started, with satellite imagery, they were taking film cameras, it was all classified, and the film was designed to disintegrate on impact with water. And so the satellites would drop film cancers and then military aircraft would be responsible for retrieving those before they hit the water and disintegrate. So we’ve gone from that, and those were incredibly low resolution images, how to be poured over by classifying analysts to cube SATs that are the size of a low threat that can be rapidly iterated on. And so the transformation that we’ve seen just on the technology side, just on the sensor side, has been dramatic.
Speaker 2 (21:19):
And, you know, there are a number of satellite imagery companies out there, really bold ambitions of imaging every part of the planet, every day or every part of the planet with an incredible level of repeat that would give, you know, Jerry the update he needs to perform timely information. So I think that’s one piece that’s, that’s very exciting. That’s more on the imagery provider tech side of the house, on the hardware side of the house, on the software side of the house, you mentioned the advancements in own networks. It’s a very fast moving field in terms of new network designs and new network architectures that are coming out. One advance in the field that I’m very excited about is synthetic imagery and synthetic data. So, as I mentioned before, if we’re trying to detect a particularly rare craft class of something, say something that we only have 10 pictures of in existence, but we won’t be able to build a detector to find instances of that, it’s gonna be very challenging.
Speaker 2 (22:12):
So synthetic data aims to fill that gap by generating images that seem plausible that you can train the network on. Now, of course, there are a lot of challenges that come with that, right? Computers are very good at finding out how other computers made something, and that ultimately doesn’t always map on exactly to real world data. But I think synthetic imagery, you know, both for EO but also for SAR, is gonna prove a very interesting field, particularly for rare classes, difficult weather conditions. All of these areas that we mentioned before, when we discussing challenges that are hard to get high quality label training data for,
Speaker 4 (22:48):
If I understand correctly, a couple years ago creating satellite image data that covered the entire planet would’ve taken two years, but as an expectation now that this data can be, taken soon in about 20 minutes or basically real time, do you think, this is realistic? Do you think that will happen?
Speaker 2 (23:06):
I mean, I think that’s something that folks have been excited about for a long time. We’ll see how long it’ll takes. I mean, we’ve certainly moved a lot in the last five years, but I do remember five years ago folks saying it’s right around the corner. It’s right around the corner. To be honest, some of these satellite consolation have been by Covid, in terms of some of their launches and some of their ability to get hardware ready. So we’re seeing some of launches for both satellites and platforms being impacted, by covid, which folks might not have expected. So I think it may happen and I certainly welcome it as a data source. I wouldn’t hang my hat on it happening anytime in the next two years.
Speaker 4 (23:42):
And Jerry, what are developments that you are excited about in your field
Speaker 3 (23:46):
In big generality? I think, the cost of data coming down, as Heidi mentioned, all of those new launches and the new data sources and also the increased computational power, of our mobile devices, for example. So a couple years ago, we probably wouldn’t imagine our iPhones be able to basically tell us what people are in our photographs. And right now we take these kind of things for granted. But like imagine like a mobile device being able to analyze a satellite imagery and provide applications to pedestrians on the streets. I think that’s possibly not,so far away because the computational power is here and all you have to do is connect does not computational power with a certain data source at a reasonable cost. And we see that in our own applications as well, that the cost of data is coming down and we’re able to basically, use less of the free sources and more move up to sort of the quality chain in terms of getting more updated data.
Speaker 3 (24:50):
And, when that sort of becomes more widely available, I imagine like this whole talk about satellite image analysis, it becomes more of more known to everyday users. In terms of methodology, I’m very excited about, applying basically the unsupervised methods to satellite imagery. So, far there’s, a few well-known papers published running those unsupervised methods on, sstl tent, for example, which is a well-known image data set of, sort of animals and various, objects like people, but with satellite imagery, those methodologies are also very, very relevant. And we are just beginning to see that. And basically what those algorithms do is that instead of having to have labels for every single image, you can perform certain types of pretext tasks, which basically maps your images to tensors or vectors,, as of, and then you perform clustering on those, sort of pretext features. And when you do that, you’re enabling your AI to be able to learn about the features of all of the satellite imagery without having all of them labeled. And, that sounds almost magical, but it actually does <laugh> work quite well. And, I’m very excited about more developments in that area.
Speaker 2 (26:25):
And I think to that point, those methods are so helpful if you don’t have enough training data, exactly. If they were saying, you know, if you don’t have enough instances, these methods can be used in tandem with one another to really augment, your model performance.
Speaker 1 (26:40):
So Heidi, what are some of the current applications, and industry for satellite imagery?
Speaker 2 (26:49):
So one of the applications that is very interesting right now is supply chain monitoring. So with the coronavirus and with, you know, changes to the workforce that we’ve seen as a of some of and manufacturing, it’s difficult to ascertain when our market coming back, how are changing, and I’m not economist, I’ll leave that, but it really interesting to be able to monitor, say the number of cars in parking lot in manufacturing oil can give you he for are coming in particular, if you understand, the patterns of life of these different plans, and you can anticipate when production is gonna begin ramping up, for example. So that gives you a sense of when different parts of the supply chain are going to come back online, and you can couple that with other sources of information, from elsewhere. So for example, if you have geolocation data from cell phones or other sources, you can combine that information to get a real sense of what sectors of the economy are starting back up, have slowed down.
Speaker 2 (27:49):
And, you know, in thinking about covid-19 and extending a little bit farther, again, thinking about car counting applications, if you count the number of cars at busy intersections in, for example, in China, you can get a real estimate of are things starting to open back up? Has lockdown started to ease, has lockdown started to ease earlier than, was originally anticipated? Those sorts of things which give you incentive about the economic impacts coming down the line. Another area of, economic interest or commercial interest that I think is quite interesting is estimating global oil reserve. So this is something that my company, orbital Insight has, a patent for. We developed an algorithm to estimate the volume of oil in floating roof storage tanks. So these tanks are large cylinders where the roof of the tank moves up and down depending on how much oil is present in the tank.
Speaker 2 (28:42):
If you have an understanding of the sun angle, this is where shadows can be an asset instead of a detriment. If you have an understanding of the angle that was taken, the time, the sort of metadata that’s available from the satellite, you can estimate the volume of oil. Now scale that to the number of known oil fields we can get imagery for, and you can start to get a really good estimate of how much oil is out there. And that allows you to anticipate some of the market changes and the price of oil and the availability of oil, which as we’ve seen over the past year when oil features negative is a really interesting, really difficult, um, space to work in. So a lot of cool, commercial applications.
Speaker 1 (29:17):
Jerry, what applications do you think are currently most interesting?
Speaker 3 (29:20):
Yeah, I’lll build on Heidi’s, points about economic predictions. So in particular, I would like to highlight agriculture. So, there is an active market for trading agricultural futures, which is basically what, how much you want to pay for coffee, for example, six months from now. And in that sense, if you are able to get this price right, whether higher or low or even negative, you are able to basically,, make the market more efficient as well as make a profit out of it. And so certain types of satellite imagery, are able to help you to predict crop yield. So you’re able to look at, more or less the image of a farmland and be able to tell, sort of how much soybean, for example, you might be expecting next month, two months from now.
Speaker 3 (30:09):
Those prediction models, as far as I’m aware, seem to have been being used, in trading as well as in planning. If you are the farmer or, someone who is looking to purchase farm goods. In terms of, more esoteric applications, which, I find very interesting is that you can now have enough resolution to look at things like wildlife migrating, for example, from one place to another and see how perhaps if there, if climate change has, sort of changed those, migration trajectories. And, and on top of that, adding another sort of interesting hobbyist type applications is that you can look at, look for archeological structures, without ever leaving your sort of your office. So basically, there are certain geometric structures to certain burial mounts of Asian cultures like that they’re often circular or sometimes rectangular. And, in many cases they are visible from a satellite image. And so you could be like the data science equivalent of Indiana Jones and look for those things and, be able to pinpoint the LA lawn of a potential new,, archeological find. And then this is <laugh>. This sounds like science fiction, but it is not. The data is out there, you may have to purchase them. And this, is sort of what is most interesting to me.
Speaker 4 (31:46):
Mm-hmm. <affirmative>, that’s super cool. Yeah, sorry. Jenna Jones five will be a movie about a data scientist
Speaker 2 (31:51):
<laugh>. Love it. We finally got a time in the limelight <laugh>,
Speaker 1 (31:56):
And, and you’re, you’re gonna tell us that your side gig is, riding an algorithm to find the holy grail.
Speaker 3 (32:03):
I can’t comment on that.
Speaker 1 (32:04):
<laugh> <laugh>,
Speaker 2 (32:07):
You know, I wanted to add one thing to what, Jerry was saying about agriculture. There’s a lot of different applications for, agricultural predictions. Of course some of them are from planning perspective. But another interesting application prediction is national security. So, you know, when you think about what, what is an, a root cause that can exacerbate conflict, lack of access to resources particular famine is a circumstance ability to monitor the state of agriculture on a global scale and anticipate months in advance or even years in advance when you might be experiencing acute famines, can help understand the sort of geopolitical tensions that could lead to an escalation and conflict.
Speaker 1 (32:52):
Just one thing I was going to ask, is there a commercial use case for satellite imaging data in assessing, Supply chain risk and advance, i e predicting a problem that might cause a collapse or stop stoppage in a supply chain?
Speaker 3 (33:14):
I think you can certainly see, containerships leaving and arriving at ports, and if you can isolate, you know, like this is the port for semiconductor export, which probably is not the case. I imagine like different industries would all be using the same ports. So I’m not exactly sure, if that is, a possibility unless the object in the supply chain is very large and visible. So, something like a huge metal pipeline or something that’s being shipped across <laugh> the sea, to fill in, you know, a large oil pipe somewhere else, in the world. I think those you can certainly, see and detect, but in, individual things that might be hidden inside boxes certainly cannot be seen, I think.
Speaker 2 (34:07):
Yeah. Well, and my inclination with that is if you can combine it with other types of information could prove really useful. So if you have a sense of, if you know the areas of interest for different manufacturing plants or you know, where the semiconductors tend to be staged before their shift, for example, and you can detect activity in these areas of interest, that’s one way that this could be used to determine in advanced potential supply chain issues. Then another way is if you can combine it with other information. So if you have geolocation data, so for example, Al Insight recently did a really exciting piece of work with Unilever where we were tracing the palm oil supply chain using geolocation data that was voluntarily provided as well as satellite imagery. So combining those sources can definitely give you an estimate. Cause if you understand how goods are flowing, specifically where goods are flowing, because oftentimes providers don’t necessarily know the exact source and their upstream products. So if you can use relocation data to trace back well, where exactly are you getting, for example, your palm oil or your raw materials from, and then if you can detect any sort of concerns there, so decrease economic output or decrease availability of resources or what have you, then you can anticipate this might be, an issue in one month or two months.
Speaker 4 (35:22):
That’s super interesting. Heidi, you mentioned your own work already a couple of times. What are the problems that you’re working on on a daily basis?
Speaker 2 (35:29):
So I work for Orbital Insight, which is a startup that does geospatial analysis in both the public and the private sector. So my background is in public sector and government work, so that’s primarily what I focus on. I’ve mentioned, already a couple of our public sector pieces of work such as, you know, oil tank estimation and supply chain monitoring. My work is more on the national security side of the house. So I work on building object detection algorithms, um, that might be of interest to national security. So for example, we detect aircraft and different types of aircraft. There are a number of different challenges in this domain in particular, a lot of the objects we’re interested in detecting are either quite rare, quite small, or both, which proves quite challenging, especially in lower resolution imagery. So there are a lot of applications for this type of work. And my work in general focuses on national security. So things, like what we might call the order of battle. So if, you know, if you see a large movement of aircraft in an area that you don’t usually see aircraft, you can anticipate that might be military buildup for some sort of action or some sort of, so those are the types of questions.
Speaker 4 (36:37):
Questions I can work on. <affirmative>. And what are the methods you’re using to answer those questions?
Speaker 2 (36:41):
So we’re using a lot of, variations on convolution neural networks. So in general, we do just a lot of, a lot of object detection. A lot of it relies on taking some state of art algorithms, state of backbone and sort of tweaking them to fit our specific requirements in terms of understanding the other size resolution, that sort of things. The other set of methods that we use and that we’re actively exploring right now is these questions of generating synthetic data.
Speaker 1 (37:11):
Jerry, you also deal with synthetic data in your everyday job. Do you want to give us an overview of your day-to-day
Speaker 3 (37:22):
<laugh>? Yes, absolutely. So, I guess, yeah, I can give the day-to-day of when I was working on the satellite imagery project and certainly,, one of the things I have to deal with is, both dealing with our artifacts, as I mentioned, before, as well as small objects. So, in insurance you care about things like worker safety. So in case of sort of lawsuit relating to, I guess injuries and stuff like that, I guess one of the things we really care about is whether or not the construction workers are wearing hat hard hats and whether or not they are using ladders. And in terms of small objects, you sort of have to be, more specialized to increase your accuracy. So for hard hats, you should focus on basically the standard issue, yellow hard hats.
Speaker 3 (38:10):
And for ladders, you want to focus on the standard industry issue type ladders. And, and you, don’t have a lot of imagery of, those things like, umbrellas on the side of swimming pools of a certain size and then hard hats and ladders. So what, you would do in this case is in fact to take, imageries of properties without those features and actually artificially insert those features. So you have to be quite careful not to insert them where they shouldn’t. So umbrellas should not appear in the middle of the swimming pool, so you need to put them sort of near the swimming pool on the side. And, in terms of sort of the hard hat, it’s, the algorithm is a little more difficult, but certainly, you don’t wanna insert too many hard hats that’s right next to each other.
Speaker 3 (38:56):
And in terms of ladders, basically we just do one or two ladders, near the property line. So, in those cases, we have to both work on sort of supervise models as well as algorithm algorithmic, ways to detect boundaries. So you need to know, when you’re doing those object insertions where the boundary of the swimming pool is, where the boundary of the property is, and also where the boundary of the fence that separates your property from your neighbor is. And, those type of problems, can be solved without having to train a model. You, can do it heuristically and algorithmically, using standard sort of image processing packages.
Speaker 4 (39:38):
Heidi, can you talk us through the data sources you’re using for your work?
Speaker 2 (39:41):
Yeah, sure. So, as I mentioned, we source, imagery from a number of different providers, mostly private providers, but since we also work with the government, we source government data. There are also a number of open source either data sets or like labeled, data instances that we rely on. So obviously, I mentioned some of the open source land use data sets. So like land stop provides open source, satellite imagery that’s really coarse. But then there are also, a couple of really specific examples. So DOTA for example, is a large scale D O T A, data set for object detection and aerial images. We love a good busted acronym like that <laugh>, so is an example of a large scale data set that contains labeled instances, overhead. The other types of images that we work with, we have, labeling subcontractors that we work with both domestically and internationally to provide labeled instances of training data. So I guess to answer your question, there’s private data, there’s government data, and then there’s open source data,and we work with whatever we can get our hands on.
Speaker 3 (40:52):
My data sources tend be quite standard, think being maps for example. So basically all I need is an aerial image of the property. And, in terms of getting the data itself, there is actually a sort of an open source data on Australian residential swimming pools. So, that was sort of the data source that I started. Of course it’s, not quite applicable to our use case for commercial properties because, residential properties have a certain, I guess, structure to them that that may not be present. And so another thing that, we’ve done that’s quite interesting is, you, can actually book travel, for example. So you think of your favorite travel booking website, orbits, Expedia, whatever you can actually specify via a filter whether or not you want your next vacation destination to have a pool or not.
Speaker 3 (41:49):
So you can use this methodology, for example, to find addresses of properties with pools. And of course you can’t specify a no pool because nobody wants to a no pool filter. So you actually have to scrape all of those hotels and sort of leave out the ones with pools and the ones with indoor pools. There’s actually a filter for indoor pool as well in my favorite travel website, which shall remain unnamed. So this is a very interesting methodology to get commercial property, especially one serving the travel and leisure industry with swimming pools. And I can mention a few other ways as well. That’s, basically a combination of using publicly available websites that can be sort of scraped and also using your own data source. You know, so there are certain, legal issues with using certain data sources.
Speaker 3 (42:44):
For example, Google Earth cannot be used commercially, so, that’s something you would have to avoid, but big maps can be used, and then, but then there are also, much more specialized paid data sources, which we subscribe to whenever, those much cheaper data sources have either outdated, I guess, imagery, for example, construction site or even if they have an imagery where the object that you want to detect is covered, like the swimming pool could be covered by something, in which case you will not be able to detect it and you don’t want to train your algorithm to try to detect those. You’d rather have a more updated and possibly a premium type image that has it. So that’s, all I have on data sources.
Speaker 4 (43:31):
Jerry, you mentioned the processing of the data already couple of times. Can you talk us a bit through the role that data infrastructure plays for your work? I would assume there’s a lot of parallel processing and GPUs being involved.
Speaker 3 (43:42):
Yeah, so certainly you need, levels of parallelization in your stack to be able to handle, large amount of data in the training process. I mean in terms of, GPUs, we use the standard, AWS type, cloud infrastructures. So those GPUs are available on demand and they can be scaled up and in essentially you will be built sort of per second of training. So in terms of that types of economics, in terms of the scaling, a lot of it is, is automated, I guess industry-wide, it’s, not, it’s unlike the type of paralyzation I would do, for example, at a high frequency trading where you are actually programming here, here this thread would do this part and that thread will do that part because you are, you want it to be run as fast as possible. A lot of the parallelization have packages associated with it. You can think Spark, for example, that that does the parallelization for you and your code looks very much the same as if you were running them sequentially. So that’s, I guess the magic, having open source software that does parallelization and distribution of resources for you, even the distribution across sort of <laugh> across hardware that’s running <laugh> somewhere else in a server farm.
Speaker 4 (45:05):
But you would still say that GPUs and data infrastructure are enabling your field to have these cool applications we talked about.
Speaker 3 (45:11):
Oh yeah, certainly.
Speaker 4 (45:13):
Heidi, how about you? Are you using similar environments or any additional requirements to data infrastructure?
Speaker 2 (45:19):
Yeah, so we face a number of sort of novel challenges on the data infrastructure side. We work with, as I mentioned, number of providers. And so we have a whole team that works with our platform to ingest that data format that ends up feeling, so the computer vision scientists like me, sort of provider agnostic. And we also have to work to build specialized pipelines for some of our government imagery, which as you might imagine comes with its own sort of host of caveats. So by the time the data gates gets to me and gets to the computer vision scientists, it’s already been, processed and shipped and tiles to smaller areas that we can work with more readily. And then when it comes time for us to use the data, yeah, we’re, we have a set of local GPOs. Cause as Jerry mentioned, the charge on the aws, you gotta be careful about the economics of that, otherwise you wind up just throw a bunch of money at a Ws, which sometimes is inevitable. So we use GPS either locally or in the cloud, for processing. But honestly like cutting down AWS costs is, I dunno what your experience has been Jerry, but for us it’s been like a really huge point is that it’s easy to, especially if you start training when people start talking about synthetic data generation, if you’re trainings for example, you can, you can really burn through some time training those up. So it is a very important question. This infrastructure one.
Speaker 3 (46:36):
Yeah. Yeah. A aw AWS does ramp up costs. I mean, like we, we had experience separately with AWS ground truth, which is actually a way for sort of, I guess it’s mechanical turk people to provide labeling for you either for images or, even natural language processing. And that <laugh> ran up a lot more cost than what we were expecting because what what it does, it provides the same questions to multiple people. And then, you know, like, so, they have a specialized way of doing it. And <laugh> in a way, we got how our data set from other people labeling, but then <laugh> the cost was, such that, I think if we were just hire people internally, it would’ve been <laugh> more or less the same.
Speaker 2 (47:15):
So yeah, yeah, for that sort of stuff. We use a company, there’s a company called Label Box that we work with, and they also work with external labeling teams. But yeah, the question of how do you get, how do you get data in like a resource efficient manner and then once you start getting into like classified data labeling, it’s a whole hassle
Speaker 3 (47:34):
The quality is suspect as well from, the ground truth mechanical Turk workers that some of them are simply mislabeled <laugh>. Yeah, yeah. And quite pointed by that.
Speaker 2 (47:47):
Yeah. This, was a challenge that we didn’t mention before, but I think is a really good one worth bringing up is, data quality. And, I know we had briefly touched on it, but there’s nothing more infuriating to me than building an algorithm. And you run it to see what the quality is and you say it’s performing so poorly, why is that? And you go in and look an example and it caught a bunch of cars that weren’t in your ground data set. And so the model isn’t performing poorly. The metrics show it’s performing poorly, but the model is actually outperforming the human annotators, but there’s no way to capture that. That’s a real source of frustration for me. And so when we talked about, you know, things that were trying to advance on, creating a human in the loop feedback where you can use a model to create sort of a first pass of data labels, then go back and have the human annotator saying that’s not a car, that’s not a truck, that’s not an airplane, that is an airplane. And sort of how the two work in tandem I think is really exciting. Cause then you wind up with fewer of these data quality issues, that seem to, so plague training data sets.
Speaker 4 (48:44):
Heidi, maybe quick follow up question. You mentioned that you guys are working on different platforms. Is it challenging for you to get data out of aws?
Speaker 2 (48:53):
Not really. Like we end up writing most of our finalized models to S3 buckets, and so kinda either way, they wind up in the same place. So, it hasn’t been a significant challenge working between the two systems.
Speaker 1 (49:06):
Maybe one final thing, are there any practitioners in the field that you think are doing really interesting stuff that maybe you look up
Speaker 2 (49:15):
I’ve been really inspired by the work, by Facebook’s AI research. They’ve put out a new object detection framework called, that we have used for a lot of things. But , the research that has come out, that group has been a really detection.
Speaker 1 (49:30):
Do you guys want to say anything about your, what your current company are doing?
Speaker 3 (49:34):
Yeah, sure. I mean, like, intellect seek, we’re serving to I guess provide, a hub of information about, potential insurance clients. So we have a number of, data pipelines, one of them being satellite imaging, where we are able to answer questions about potential natural hazards such as flooding, that may affect the client. And, we also have a, contextual sentiment model that’s currently on, a AWS marketplace, which we use internally to look at the employee reviews of, those companies. So, essentially if everyone is complaining about working at a company, then, then perhaps they are not so credit worthy <laugh> in terms of, taking out insurance. So, if we were doing quite a lot of work on in those areas. And, yeah, I think it’s very interesting.
Speaker 2 (50:37):
The work that we do is, about geospatial analysis pretty broadly construed. So today I’ve spoken a lot about satellite imagery, but we also do a lot of work with geolocation data, which uses cell phone data and other location sources to provide an estimate of how flows of people and goods are changing. Which in the time of covid has been very interesting. So that includes heat map level data to show how, you know, foot traffic in a mall is changing over time. So there’s a large variety of interesting, questions and challenges. Can you answer to that? Anyone’s interested to know more? Go to orbital com, tell them, sent you <laugh>.
Speaker 2 (51:15):
But one thing that I think is really important when we consider data science broadly, but satellite imagery specifically is the role that ethics might play in some of this. You know, we’re talking about viewing objects, viewing supply chains, and perhaps even viewing individual vehicles or individuals from space. And there are some ethical questions that arise, with this type of data. So ethics is a difficult, complicated field, which isn’t my specialty, but I think for, data scientists of all levels, it’s important to consider where is this data coming from? How is it being used and is it being used in a way that is ethical and that we can ethically justify? So a small pitch for all data scientists to consider the ethical downstream and upstream impacts of their work.
Speaker 1 (52:01):
And that brings to a close this episode of the Data Science Conversations podcast. Thank you so much for listening and thanks also to my co-host Philipp Diesinger and to our two amazing guests, Heidi Hurst and Jerry He. If you enjoyed this episode, please do leave us a review on your favorite podcasting platform and we look forward to you joining us on our next episode.