Managing Artificial Intelligence Projects: Framework for Success. Part 4
Data collection
Data sources identified during problem formulation often grow organically, lack a unified structure, and exist in silos. For instance, patient or customer data is typically spread across various dimensions like demographic, behavioural, psychometric, transactions, consultations, or feedback. These dimensions are saved in different systems, at different times, by different individuals.
Using siloed data for AI models is generally not recommended, as it causes data bias that affects model outcomes and generalizability to new data. A better approach is to consolidate data into a unified repository – a single source of truth, such as a data warehouse, or data lake, before AI model development. This centralizes data access, ownership, stewardship, metadata, governance, and regulations.
However, in certain well-defined scenarios, siloed data may be necessary for model effectiveness or to meet privacy and confidentiality constraints. Not every AI project has the budget or timeline for unified data, making the use of siloed data an acceptable risk that must be documented. A modern solution to siloed data is using AI to create a meta-layer or representation of all data sources, both structured and unstructured, through techniques like federated learning, representation learning, and latent learning.
Data exploration
This stage focuses on actual data, unlike the prior stage, which emphasized data structure. It usually starts by comparing industry benchmarks and algorithmic baselines. The prepared data structure is then filled with actual data using techniques like data visualization, correlation analysis, granularity alignment, relationship checking, outlier handling, and quality enforcement.
Data visualization includes pairwise plots, dataset projections, and interactive dashboards to scrutinize datasets for accuracy, relevance, quality, and volume. Multicollinearity and highly correlated attributes are typically removed but occasionally labelled for further analysis. Aligning granularity ensures consistent record representation. For instance, capturing both date and time for transactions enhances AI model accuracy.
Attributes and data points should be checked for various relationships (linear, cyclic, probabilistic, differential) and represented accordingly. For example, graph data should use AI algorithms designed for cyclic relationships. Outlier detection methods depend on the type of relationship and may result in removal or labelling, depending on the problem context. Data quality ensures the data is "fit for purpose," supported by various frameworks and standards.
This stage also involves data splitting for model development and evaluation. Model evaluation tests the AI model on new data to assess its generalization. Data splitting methods include the holdout method (80% training, 20% testing) and the calibration method (e.g., 70% training, 10% validation, 10% testing). The validation phase fine-tunes model parameters. N-fold cross-validation is used for smaller datasets, where the dataset is divided into n segments and tested multiple times to ensure unbiased evaluation.
During the data preparation and exploration stages, you might discover that the existing data is insufficient for building AI models. In such cases, it's important to look for external data sources. One popular solution is to purchase the necessary data from data brokers and vendors. Acquiring data from brokers and vendors is a common short-term solution. For example, real estate companies often buy neighbourhood demographic data from data vendors, which use public records and social media content. Similarly, e-commerce platforms might purchase consumer behaviour data to better understand market trends.
Before acquiring external data, a formal risk assessment is needed. This includes evaluating the data provider's preprocessing methods, governance practices, and domain-specific policies.
However, relying on external data has ethical concerns and sustainability issues. Thus, it's advisable to develop a long-term strategy based on transparent and trustworthy relationships with stakeholders, clearly stating what data will be collected and how it will be used.
Data pre-processing
Data pre-processing ensures that acquired data for building AI models or applications can be accurately fed into algorithms. This minimizes compromises in accuracy, information value, and data quality. It includes imputation, formatting, and transformation.
Imputation identifies and replaces missing values using statistical or machine learning methods. In simple statistics, methods like mean, median, or mode are used. More advanced techniques include Multiple Imputation with Chained Equations.
Data formatting standardizes units of measurement, date formats, and groups across variables and the AI model's expected output. Transformation converts non-numerical data into numerical representations through one-hot encoding.
Feature engineering
Not all parts of the dataset are equally valuable for decision-making. During the feature engineering stage, we identify and construct features that are crucial for building the model. Sometimes, essential features are not directly available in the dataset and need to be created from raw data—this is the essence of feature engineering. For example, in a retail scenario, sales data might not directly have customer lifetime value, so you would calculate it by aggregating individual transaction data over a period. In another case, analysing flight data might require deriving delay durations from the scheduled and actual times of departure and arrival.
For certain models, such as convolutional neural networks (CNNs), feature engineering is less explicit and often merges with the subsequent stage of model training. In CNNs for image recognition, for instance, the network itself learns relevant features like edges, textures, and patterns, making the feature engineering process integrated into the training phase.
By Yesbol Gabdullin, Senior Data Scientist, MSc in Artificial Intelligence, MRes in Engineering, BEng in Electronics and Control Engineering, Microsoft Certified
The article was written for the Forecasting the Future of AI Applications in the Built Environment session of the Project Controls Expo 2024.