Generative AI as an Assistant in Database Design

Back | BLOG

May 31, 2024

Generative AI as an Assistant in Database Design

The advent of AI, particularly LLMs (Large Language Models), has undoubtedly revolutionized the software world, transforming the development process by providing advanced tools and capabilities that enhance efficiency, quality, and collaboration at all stages of the software development lifecycle.

Today, generative AI can generate source code from a list of requirements or descriptions given in natural language, act as programming assistants by providing suggestions and corrections as the developer writes code, and generate documentation from the source code, thereby improving code comprehension and collaboration among developers. These capabilities help improve productivity, quality, and accelerate the software development process.

In this article, I aim to review how generative AI behaves when it comes to software modeling, specifically database modeling using ERM diagrams. I will evaluate its capabilities and limitations to determine if it is helpful in the early stages of software construction. To do this, I posed a series of requirements to ChatGPT, asked it to create an entity-relationship model (ERM), and subsequently analyzed the resulting representation.

Juan Ignacio

Senior Software Engineer

Table of Contents

Reviewing Some Theoretical Concepts Case 1 Case 2 Conclusion

17 minutes read

Have a project?

Tell us more

Reviewing Some Theoretical Concepts

Some topics I intend to discuss bring to mind concepts, definitions, or strategies that I learned during my student days and that help me address the subject. Reviewing old university notes from the Database course, but still valid and to be taken into account when modeling a problem, these indicate that to ensure the quality of conceptual schemas, designs must meet a series of properties:

Completeness:

All relevant features of the problem are represented.

Correctness:

Syntactic: The model represents the concepts using appropriate specifications given by the language.

Semantic: The model represents the problem.

Minimality:

An element of reality is represented only once in the schema.

Expressiveness:

Represents reality in a natural and understandable form using only the semantics of the model.

Explicitness:

Does not use more formalisms than the E-R diagram.

A good design should maximize Completeness and Correctness and find a balance between Expressiveness and Explicitness.

In the next section, I present a series of case studies where I asked ChatGPT to model a set of requirements to verify the quality of the result by analyzing whether it meets each of the mentioned points: Completeness, Correctness, Minimality, Expressiveness, and Explicitness.

Case Studies

Case 1

Requirements:

A construction company, based on the design of its projects, creates material requirements documents. Each requirement generates more than one purchase order. It wants to model what requirements require what materials, originating what orders.

ChatGPT Solution

Entities:

Project

Material

Requirement

Purchase Order

Attributes:

Project:

ProjectID (Primary Key)

ProjectName

ProjectDescription

Material:

MaterialID (Primary Key)

MaterialName

MaterialDescription

Requirement:

RequirementID (Primary Key)

ProjectID (Foreign Key referencing Project)

MaterialID (Foreign Key referencing Material)

Quantity

RequirementDescription

Purchase Order:

OrderID (Primary Key)

RequirementID (Foreign Key referencing Requirement)

OrderDate

DeliveryDate

OrderStatus

Relationships:

One-to-Many Relationship between Project and Requirement:

A Project can have multiple Requirements.

A Requirement belongs to only one Project.

Many-to-Many Relationship between Material and Requirement:

A Material can be required by multiple Requirements.

A Requirement can require multiple Materials.

One-to-Many Relationship between Requirement and Purchase Order:

A Requirement can generate multiple Purchase Orders.

A Purchase Order is generated from only one Requirement.

Next, I asked to add integrity constraints to the previous model, and ChatGPT's response was as follows:

Entity Integrity:

Each entity must have a unique identifier (primary key) that uniquely identifies each instance of the entity.

Example: ProjectID for Project, MaterialID for Material, RequirementID for Requirement, and OrderID for Purchase Order.

Referential Integrity:

Foreign key values must match the primary key values in the referenced table or be null.

Example: ProjectID in Requirement must exist in Project, MaterialID in Requirement must exist in Material, and RequirementID in Purchase Order must exist in Requirement.

Domain Integrity:

Attributes should have appropriate data types and constraints to ensure data accuracy.

Example: Quantity in Requirement should be a positive integer, OrderDate and DeliveryDate in Purchase Order should be valid dates, etc.

Business Rules:

Define and enforce any specific business rules related to the relationships between entities.

Example: Ensure that a Purchase Order can only be generated for a Requirement that belongs to the same Project.

Cascade Update/Delete:

Define whether updates or deletions to primary key values cascade to related foreign key values.

Example: If a ProjectID is updated in Project, corresponding Requirement records should also be updated with the new ProjectID.

Constraints on Order Status:

Define constraints on the possible values for OrderStatus to ensure consistency and validity.

Example: OrderStatus can only have values like "Pending", "In Progress", "Completed", etc.

Representation Analysis

While the design models the presented reality, I will now analyze its quality.

Completeness: All relevant entities of the problem and their relationships are represented, so we can say that the design is complete.

Correctness: The entire problem is represented in the schema, although it adds attributes to the entities that are not required. Ignoring this last point, we can say it is semantically correct, but the notation used is descriptive and not represented through an ERM language. In fact, it has many limitations to show the design graphically, so we cannot say that the model is syntactically correct.

Expressiveness and Explicitness: The result is understandable if we ask for a textual description of the model since the graphical representation turns out to be very confusing as it has many shortcomings and errors, not making clear which entities are related, what their cardinality is, totality, etc. Thus, we cannot say that the model meets the properties of expressiveness and explicitness.

Minimality: Each element of the problem is uniquely represented, so we can say that the model is minimal.

All problems can have multiple solutions, and all can be correct and of quality considering the above points. Although the model presented by ChatGPT is correct, it has some details that were not considered. For example, this representation allows for "Material - Requirement" pairs without an associated "Purchase Order." This could be corrected in several ways, for example:

Using multiple relationships:

Using Aggregation and Totality:

Case 2

Requirements:

Each mechanic repairs several cars. Each mechanic uses a certain tool depending on the car he works on.

ChatGPT Solution

Entities:

Mechanic

Car

Tool

Repairs

Uses

Attributes:

Mechanic:

MechanicID (PK)

Name

Experience Level

Contact Info

Car:

CarID (PK)

Make

Model

Year

Owner Info

Tool:

ToolID (PK)

ToolName

ToolType

Usage Instructions

Repairs:

MechanicID (FK)

CarID (FK)

RepairDate

RepairDetails

Uses:

MechanicID (FK)

ToolID (FK)

CarID (FK)

UsageDate

SpecificUsageDetails

Relationships:

Many-to-Many Relationship between Car and Mechanic:

Each Mechanic can repair several Cars.

Each Car can have several Mechanics working on it.

Many-to-Many Relationship between Car and Tool:

Several Tools can be used in each Car.

Each Tool can be used on several Cars.

Next, I asked to add integrity constraints to the previous model, and ChatGPT's response was as follows:

Entity Integrity:

Each entity must have a unique identifier (primary key) that uniquely identifies each instance of the entity.

Example: MechanicID for Mechanic, CarID for Car, ToolID for Tool.

Referential Integrity:

Repairs: MechanicID (Foreign Key) must exist in Mechanic table - CarID (Foreign Key) must exist in Car table.

Uses: MechanicID (Foreign Key) must exist in Mechanic table - ToolID (Foreign Key) must exist in Tool table - CarID (Foreign Key) must exist in Car table.

Representation Analysis

While the design models the presented reality, I will now analyze its quality.

Completeness: All relevant entities of the problem and their relationships are represented, so we can say that the design is complete.

Correctness: As in Case 1, it adds attributes to the entities and relationships that are not in the requirements. The entire problem is represented in the schema, so in this case, we can also say it is semantically correct. Beyond the notation used and that the model is not represented through an ERM language, this model has integrity issues as it allows a "Mechanic" to repair a Car with Tools that are not used for that Car. For both reasons, we cannot say that the model is syntactically correct.

Expressiveness and Explicitness: As in Case 1 and for the same reasons, we can say that the model does not meet the properties of expressiveness and explicitness.

Minimality: It can be observed that in the model, the Mechanic relates to the car in two ways:
Repair: The cars they repair.
Uses: The tools they use to repair it.

In my opinion, these relationships are redundant and unnecessary, so we cannot say that the property of minimality is met. This problem can be solved using aggregation, where the "Repairs" relationship is a strong entity in the "Uses" relationship.

Conclusion

The design phase is a crucial stage in software construction as the quality and success of a project are tied to good design. Artificial intelligence (AI) has introduced tools and techniques that have significantly improved efficiency, precision, and innovation in software development. It has become a fundamental ally for developers by facilitating the automation of tasks such as debugging and code generation, and improving team collaboration by automating documentation and providing intelligent interfaces that efficiently integrate and organize information. This allows developers to focus on more creative and strategic aspects.