IBM is a major player in the mainframe market. I had a remote chat with IBM data scientist Liu Lu about how AI Singapore and IBM recently collaborated to build a solution to help their product quality engineers improve product quality classification in their mainframe product line by making better use of their data.
Below is a transcript of the conversation [*].
Basil : Hi, Liu Lu. Thanks for being here today.
Liu Lu : Hello, thank you.
Basil : So, Liu Lu. You work as a data scientist in IBM. This sounds like a really cool job. How did you arrive at this role and what is a typical day like for you?
Liu Lu : I joined IBM as an intern in 2014 and became a regular employee in 2015, so technically this is my sixth year at IBM as a data scientist. In my current role, I work with domain experts to solve business problems by developing AI solutions for IBM’s supply chain.
Basil : I know that IBM is a pretty big organisation, so let’s just focus on the part of it involved in the 100E in which, of course, you played an important role. I understand that it involved the manufacture and delivery of products in IBM’s mainframe product line. So what was the problem that you guys wanted to solve?
Liu Lu : Well, IBM is a very old company and has been making mainframes since the 1960s. We provide our customers one of the most reliable platforms for mission-critical hybrid multi-cloud environments offering cloud-native experience. So our warranty on maintenance and storage systems can extend to five years and up to ten years depending on the customer warranty terms and agreements. For us, we need to guarantee the storage product reliability and in this case, our client is the IBM engineering team. We’re responsible for supply quality management and we drive quality improvements and establish quality matrices to review supply performance and also identify root cause analysis and provide effective corrective actions to quality issues.
Basil : So, the applications are very, very critical applications, right?
Liu Lu : Yes.
Basil : And the whole process, I suppose, must be very data rich and just waiting for a data analytics solution where applicable, right?
Liu Lu : Yes, we had two major challenges. Quality engineers have to deal with large data volume and high data velocity, and also data cleanness issues and now we have different formats of data like structured data or unstructured data, so now information is actually booming and we’re getting more information, but at the same time we might also be overwhelmed by information. How can we enhance our engineers’ capability to deal with all this information? Quality problem detection is really critical to our engineers and we want to improve the problem detection accuracy and efficiency, so that proactive actions can be taken to prevent client loss.
Basil : How about the collaboration with AI Singapore, because it was a 100E and we had a team of engineer and apprentices who worked with you guys. How did that come about?
Liu Lu : So, in this case we wanted to design an AI system to augment human capabilities and the business problem was to identify product quality issues and reduce the investigation time from one week to one hour. In order to achieve this goal, we needed to know what took engineers so long to do the investigation. Before we reached out to AI Singapore (AISG), our engineers’ job requires them to track product performance and so in IBM we did a lot of work to automate the data processing and automate the data visualisation and so on. But now, we found that there were too many charts for our engineers to view and find out which parts were having problems. Then we came up with this idea : why not train an engine to help engineers read the charts and classify the products into different categories? This will definitely augment our engineers’ capabilities. At the same time, AISG has the 100E programme and we understood that AISG has many experts specialising in deep learning, so this really aligned with our modeling objective. We talked to AISG and, in the subsequent collaboration, we were assigned apprentices to study the problem and help to design the model. So, in this case, we transformed the business problem into two technical problems. One was a classification problem. We were trying to build a model to categorise products into three different categories : high-risk, medium-risk and low-risk. A high risk product would be pushed to our engineers as an alert. Another one was a regression problem. In order to foresee the quality problems, we also designed a predictive analysis to predict a product future failure rate. A predicted failure rate exceeding a threshold will also alert our engineers so that proactive actions can be taken in time. So that’s basically what came about in our collaboration.
Basil : So, now you have defined the problems and you have assembled the team. The project kicked off and you have entered execution mode. What were the challenges that you guys had to overcome along the way?
Liu Lu : Well, one of the challenge would be how to manage concerns and priorities. In this project, there were multiple stakeholders and they had different focal points. We assigned a champion who was responsible for collecting requirements and prioritise the items. A second challenge, which is very common, was managing user expectations. In order to solve this, we organised design thinking workshops to get domain experts involved in the AI development process to ensure the pain points were fully understood and the solution was highly aligned with the requirements.
Basil : Yes, from all the projects that we have seen, it is not just about technical problems. It is also about managing the stakeholders. This is a very important part of the whole process.
Liu Lu : Yes.
Basil : Then, of course, technically there were also challenges, right?
Liu Lu : Yes, we actually learned a lot from the technical side. Let’s talk about the predictive analysis, for example. The Weibull distribution is one of the modeling distributions used by the industry for many years. When we started to build the model, that was what first came to our mind and we spent a lot of time on it. Well, the model didn’t really work very well for our products, especially for products which didn’t really have a lot of data to train with – the so-called cold start problem. In the end, we had a meeting with AISG and we decided to try out other methodologies. Surprisingly, a time-series model worked best in the end. Through this, we learned that understanding the domain is very important for data scientists. Do not lock your mind, stay hungry to discover more and keep a fresh eye to new approaches. Before we reached out to AISG, we had already trained a classification model. We wanted to improve the model performance five to ten percent more, which was actually very challenging. In this collaboration we found a better way to generate image colours which contributed a lot to the improvements. That was what we learned from AISG on the technical side.
Basil : That’s interesting. Could you share a little bit more of the technical details on the data and modeling part of the journey?
Liu Lu : Yes, sure. We can start with the data. Data is always the foundation before we move on to any analysis. In this case, the first challenge was the data size. Every month, there are millions of records coming in. We used data from the past ten years, reaching billions of records. So laptop CPUs were definitely not enough. We needed to use at least four GPUs to run the model. When we built the model, when we looked at time-series data, we always think about line charts. Well, in this case, we transformed the data into an image to display the product quality. An image is always more intuitive than numbers and it became more efficient to identify the product quality. This model, as I mentioned before, actually mimics the engineer’s view to identify product risk levels. So previously, engineers looked at tables of data to identify the problems, followed by line charts, but they got overwhelmed by the information – there were too many charts for them to view. So now, we changed to images and we trained the engine to classify all these images. Based on that, products were classified into three different risk level. We trained the deep learning model to understand the images and it was able to classify the images like our quality engineers. This is about the modeling part. For the evaluation trade-off, there are many ways to define colours in an image. Some require more complex algorithms, so in this case we traded off the training time and model accuracy. Actually, this is a very common trade-off in all model building processes. In the end, we achieved our goal. The model accuracy improved about ten percent with the training time still remaining at fifty minutes which was good enough for us.
Basil : So I see that it is a clever transformation of the original numeric raw data into an image form. And in developing the solution, the human remains in control but with enhanced abilities.
Liu Lu : Yes, because from my own perspective, all these AI technologies in the end can absolutely help to augment human capabilities. AI plus human, I think that works best for now.
Basil : So there is still a human in the loop in this solution – also an example of how humans and machine can work together to achieve better results.
Liu Lu : Yes.
Basil : Thanks, Liu Lu, so much for sharing with us IBM’s intelligent use of AI technology in the business process. I hope that we have a chance for future collaboration between AI Singapore and IBM.
Liu Lu : Thank you very much. It is my pleasure and it has been a good experience collaborating with AISG and I think we learned a lot as well.
Basil : So, thanks for being here today.
Liu Lu : Thank you.[*] This conversation was transcribed using Speech Lab. The transcript has been edited for length and clarity.