One of the core things we focus on in our Cognilytica AI & Machine Learning training and certification is that machine learning projects are not application development projects. Much of the value of machine learning projects rest in the models, training data, and configuration information that guides how the model is applied to the specific machine learning problem. The application code is mostly a means to implement the machine learning algorithms and “operationalize” the machine learning model in a production environment. That’s not to say that application code is not necessary — after all, the computer needs some way to execute the machine learning actions — but focusing a machine learning project on the application code is missing the big picture. If you want to be AI-first for your project, you need to have a data-first perspective.
Use data-centric methodologies
As we discussed in our previous article on AI methodologies, if you’re going to have a data-first perspective, you need to use a data-first methodology. There’s certainly nothing wrong with Agile methodologies as a way of iterating towards success, but agile on its own leaves much to be desired as it’s focused on functionality and delivery of application logic. In our previous article we outlined a data-centric Agile methodology approach that merges the CRISP-DM methodology with agile to bring the best of both worlds together. While this is still a new area for most enterprises implementing AI projects, we see this sort of merged methodology approach to be more successful than trying to shoe horn all the aspects of an AI project into existing application-focused Agile methodologies.
Digging a bit deeper, it makes sense to look at what the specific artifacts of the AI project need to be to have the most success. After all, what we’re delivering with an AI project is not functionality, but data. So, what are those different data artifacts?
- Business Understanding Artifacts
- Business Background
- Business Objectives
- Business Success Criteria / KPIs
- Cost / Benefit Analysis
- Resource Inventory
- Initial Project Plan
- Resource Allocation
- Tool Selection Criteria
- Data Understanding Artifacts
- Data source identification
- Data collection report
- Data description
- Data quality analysis
- Data cleansing requirements
- Data Preparation Artifacts
- Data set description
- Data selection rationale
- Data cleansing reports
- Derived attributes and generated records
- Merged Data
- Reformatted Data
- Data Modeling Artifacts
- Algorithm selection approach
- Modeling technique
- Modeling assumptions and hyperparameter configurations
- Training set selection and training method
- Test set selection and test method selection
- Generated models
- Model assessment & validation
- Hyperparameter revisions
- Model Evaluation Artifacts
- Evaluation of model performance
- Alignment of model results with business requirements and KPIs
- Review of process
- Operationalization requirements
- Next iterations of model and artifacts
- Deployment Artifacts
- Deployment code development
- Deployment plan
- Monitoring and Maintenance plan
- Alignment of deployment with business objectives
Then we need to look at what are the AI-specific activities we need to do to create those artifacts. Some of those activities are things that a data science role would do, while others (maybe even most) are data engineering activities. Still others are functions of business analyst and data analyst roles. At a high level, those activities and roles include:
- Business Strategy development: Business analyst, solution architect, Line of Business (LoB), Data Scientist
- Dataset Preparation & Pre-Processing: Data analyst, Data Engineer, Data Scientists, Domain specialists, External Contributors, Third-Parties
- Dataset Splitting: Primarily data scientists with some data engineer involvement
- Algorithm Selection, Model & Ensemble Development: Data scientists
- Model Training: Data scientists
- Model Evaluation & Testing: Data scientists
- Model Deployment with Governance Framework: Data Engineer, Systems Engineers, Data Team, Cloud team
- Business / KPI evaluation: Business analyst, solution architect, Line of Business (LoB), Data Scientist
- Model iteration: Data analyst, Data Engineer, Data Scientists
As you can see, while agile methodologies are applicable here, they need significant modification to be used in an AI context.
Use data-centric technologies
It stands to reason that if you have a data-centric perspective on AI then you need to followup your data-centric methodologies with data-centric technologies. This means that your choice of tooling to implement all those artifacts detailed above need to be, first and foremost, data-focused. Don’t use code-centric IDEs when you should be using data notebooks. Don’t be using enterprise integration middleware platforms when you should be using tools that focus on model development and maintenance. Don’t be using so-called machine learning platforms that are really just overgrown big data management platforms. The tools you use should support the machine learning goals you need, which are in turn supported by the activities you need to do and the artifacts you need to create. Just because a GPU provider has a toolset doesn’t mean that it’s the right one to use. Just because a big enterprise vendor or a cloud vendor has a “stack” doesn’t mean it’s the right one. Start from the deliverables and the machine learning objectives and work your way backwards.
Another big consideration is where and how machine learning models will be deployed – or in AI-speak “operationalized”. AI models can be implemented in a remarkably wide range of places — from “edge” devices sitting disconnected to the internet to mobile and desktop applications; from enterprise servers to cloud-based instances; and in all manner of autonomous vehicles and craft. Each of these locations is a place where AI models and implementations can exist. This highlights even more so how ludicrous the idea of a single machine learning platform is. How can one platform at the same time provide AI capabilities in a drone, mobile app, enterprise implementation, and cloud instance. Even if you source all this technology from a single vendor, it will be a collection of different tools perhaps under a single marketing umbrella rather than a single platform that will execute.
Build data-centric talent
All this methodology and technology can’t assemble itself. If you’re going to be successful at AI projects you’re going to need to be successful at building an AI team. And if the data-centric perspective is the correct one for AI, then it makes sense that your team also needs to be data-centric. The talent to build apps or manage enterprise systems or data is not the same to build AI models, tune algorithms, work with training data sets, and operationalize ML models. The primary core of your AI team needs to be data scientists, data engineers, and those folks responsible for putting machine learning models into practice. While there’s always a need for coding, development, and project management, finding and growing your data-centric talent is key to long term success of your AI initiatives.
The primary challenge with building data talent is that it’s hard to find and grow. As Cognilytica often says, “you can’t code academy your way to data talent.” The primary reason for this is because data isn’t code. You need folks who know how to wrangle lots of data sources, compile them into clean data sets, and then extract information needles from data haystacks. In addition, the language of AI is math, not programming logic. So a strong data team is also strong in the right kinds of math to understand how to select and implement AI algorithms, properly tweak hyperparameters, and properly interpret testing and validation results. Simply guessing about and changing training data sets and hyperparameters at random is not a good way to create AI projects that deliver value. As such, data-centric talent grounded in a fundamental understanding of machine learning math and algorithms combined with an understanding of how to deal with big data sets is crucial to AI project success.
Prepare to continue to invest for the long haul
It should be pretty obvious at this point that the set of activities for AI are indeed very much data-centric and the activities, artifacts, tools, and team need to follow from that data-centric perspective. The biggest challenge is that so much of that ecosystem is still being developed and is not fully available for most enterprises. AI-specific methodologies are still being tested in large scale projects. AI-specific tools and technologies are still being developed, enhanced, and evolutionary changes are being released on a rapid scale. AI talent continues to be tight and is an area where we’re just starting to see investment in growth of this skill set.
As a result, organizations that need to be successful with AI, even with this data-centric perspective, need to be prepared to invest for the long haul. Find your peer groups to see what methodologies are working for them and continue to iterate until you find something that works for you. Find ways to continuously update your team’s skills and methods. Realize that you’re on the bleeding edge with AI technology and prepare to reinvest in new technology on a regular basis, or invent your own if need be. Even though the history of AI spans at least seven decades, we’re still in the early stages of making AI work for large scale projects. This is like the early days of the Internet or mobile or big data. Those early pioneers had to learn the hard way, making many mistakes before realizing the “right” way to do things. But once those ways were discovered, organizations reaped big rewards. This is where we’re at with AI. As long as you have a data-centric perspective and are prepared to continue to invest for the long haul, you will be successful with your AI, machine learning, and cognitive technology efforts.