The Choice of Data Types May Be More Important Than You Think

The Choice of Data Types May Be More Important Than You Think

⌘ Key Takeaways

  • Different data types can have different sizes and storage requirements, which can impact the amount of storage space required and the speed of data retrieval.

  • When working with numerical data, it is important to consider the precision required and whether floating-point or integer data types are more appropriate for the data.

  • Text data types can include variable-length fields, which can impact storage requirements and indexing capabilities.

  • Boolean data types are useful for representing true/false values, but may not be appropriate for all types of data.

  • Date/time data types can include a range of values, from seconds to centuries, and require careful consideration of time zone and daylight saving time rules.

  • Data type selection should take into account both the current needs of the data analysis as well as any potential future needs.

  • Improper data type selection can lead to data corruption or loss, which can be costly and time-consuming to recover from.

  • Data type selection can also affect the performance of queries and data analysis tools, so it is important to choose the right data type for the job.

  • Data validation and cleaning can help to ensure that the data is properly formatted and that any errors or inconsistencies are addressed before analysis.

  • In order to make informed decisions about data type selection and other aspects of data management, it is important to have a solid understanding of both the technical and domain-specific aspects of the data being worked with.

Hello to all. The length of this blog post may be due to the depth with which we discuss the significance of data type selection in databases. Because choosing the data types for a database may not seem like a big decision, it is frequently neglected. The database's performance and scalability, however, can be significantly impacted by selecting the right data types. We first look at the image below to get a mental picture of how a database engine operates before moving on to the first component, which is data.

Data types in the context of databases specify the kind of data that can be kept in a specific column of a table. As an illustration, the definition of a column that stores a person's age as an integer may differ from that of a column that stores a person's name as a string.

Why is Data Type Selection Important?

Choosing the appropriate data type for each column can significantly affect the performance and scalability of the database. Here are some reasons why:

  • Storage Space - Different data types require different amounts of storage space. For example, a 32-bit integer requires 4 bytes of storage, while a 64-bit integer requires 8 bytes of storage. Choosing the appropriate data type can save disk space and reduce storage costs.

  • Performance - Different data types have different performance characteristics. For example, integer operations are typically faster than string operations. Choosing the appropriate data type can improve query performance and reduce query execution time.

  • Data Integrity - Choosing the appropriate data type can help ensure data integrity. For example, if a column is defined as an integer, the database will not allow non-numeric values to be stored in that column. This can help prevent data corruption and ensure data consistency.

  • Scalability - Choosing the appropriate data type can also affect the scalability of the database. For example, if a column is defined as a string with a fixed length of 100 characters, the database will reserve 100 bytes of storage space for each row, even if the actual data stored in the column is much smaller. This can limit the number of rows that can be stored in the database and reduce scalability.

Data types are crucial due to storage space, performance, data integrity, and scalability, as mentioned above. We often forget to define the data types, particularly string properties, when using ORM tools like EFCore, Hibernate, Prisma, etc. The default data type for ORM is then varchar(max) in SQL Server or text in PostgreSQL.

Initial impressions are positive; there are no performance issues, and scalability and high availability are both perfect. As data continues to grow, we experience issues like memory exhaustion, timeouts, and slow response times. After giving it some thought, we increase the sources, including the driver and RAM. because many large companies engage in similar behavior. Source expansion, whether horizontal or vertical, is initially going well. When data grows rapidly, your system starts warning again and again. You are getting yourself into a vicious circle.

It is not a very rational approach to spend so many resources in today's world, where green code has become widespread and popular. Yes, if we really need more resources, we can do it. But first, we must consider how we can use our resources with the least amount of energy. What is the best resource for us? Is there another way to improve system performance? Maybe I'll reconsider?

The first requirement is that you understand how the database stores data, how the query engine executes SQL queries, and how memory works in the buffer. When you have this knowledge, you will understand why data type selection is important.

How is the data stored in the database?

I don’t spin out the storing database all over the edge. Because this essay is not a university master program essay, I visualize the storing data logically as simply as possible.

Imagine you create a table that has many columns that have various data types, your table storage will be like the below image.

Above the image, you can see a table with many rows and columns. When you select data from any table, you will see something similar to the image above. However, you do not see the database catalog; instead, you only see data and possibly data types. The database catalog contains information about table schema.

The database catalog in PostgreSQL is a system schema that contains information about the database and its objects. The database catalog is an important part of PostgreSQL architecture because it gives the server the information it needs to manage the database system and its objects. It is also used by PostgreSQL tools and utilities to display information about tables, such as the psql command-line tool and graphical database management tools.

You see all the pages above the image. Table content is stored as a collection of pages file. Each page typically stores a few KB of data. It is enough to store multiple rows but not the entire table.

You may ask this question, I understand the page logic, but how do they relate to each other? Because when you retrieve a bunch of data from a table you don’t see the pages, you only see rows and columns. So you think that the query engine retrieves the data from page a logic that I don’t know for now.

Let’s draw the pages stored in the database.

A header page in PostgreSQL is a special page at the start of each file that stores metadata about the file itself, such as its size, format, and version. The header page also includes details about the database to which the file belongs, such as the database name, encoding, and version number.

The header page is an essential part of PostgreSQL's storage architecture because it allows the server to quickly determine important information about a file and its associated database. The server uses this information to manage file I/O operations and ensure data consistency and integrity.

A directory page, on the other hand, is a type of data page that is used in PostgreSQL's storage architecture to store pointers to other data pages. Directory pages are used in multi-level indexes to help users navigate the index hierarchy and quickly find the data they are looking for.

Directory pages contain information about the data pages they point to, such as the page number, block number, and the page's minimum and maximum values. PostgreSQL can efficiently search and retrieve data from large indexes by storing this information in directory pages, eliminating the need to read every data page in the index.

When you retrieve data from now on, you imagine the logic's workflow. Although we understand the majority of it, it cannot be said that it is finished. Because you can state what the significance of data type selection is. They all go to the pages, which we retrieve page by page. You are correct that the query engine retrieves data page by page.

It first goes to the database catalog to understand what is in the table, then to the page header and directory page to understand which data types they have and how large they are. The query engine requires this information in order to allocate memory to execute the query. So it is important to understand how the page stores the data inside. * It uses a page-based storage model in which each page is divided into slots, each of which stores a record, such as a table row. This design enables fast data retrieval and manipulation.

Pages can be divided into slots in a variety of ways, including fixed-length and variable-length records. Fixed-length records have a fixed size, whereas variable-length records can vary in size. Because one column in PostgreSQL is 4 bytes long, the slots must be divided into fields.

Fixed-length or variable-length data can be stored in fields. Integer, boolean, and decimal data types are examples of fixed-length fields. The sizes of these fields are saved in the database catalog, allowing for quick and easy data retrieval.

Variable-length fields, on the other hand, store data types such as text on the page itself. This design choice was made because the size of the data can vary greatly, making it difficult to predict the size of the field ahead of time.

PostgreSQL uses a method known as heap storage to effectively store variable-length fields. By referencing the actual data with a series of pointers, this technique enables flexible data storage. While the data itself is kept on the page, the pointers are kept in the slot. Variable-length data can be stored and retrieved effectively using this method.

Using database catalogs is a key component of PostgreSQL's storage model. Information about the database schema, including the names of tables and columns, data types, and sizes, is kept in these catalogs. The database system makes use of this data to efficiently retrieve and manipulate data.

PostgreSQL can efficiently store both fixed-length and variable-length data types by dividing pages into slots and fields. The use of database catalogs and heap storage increases the system's efficiency, making it a popular choice for many applications.

We understand how data is stored in a database. Then we'll consider which data type is best for database performance. If you properly instruct your table, PostgreSQL will store data efficiently.

It is time to implement good database design in order to improve performance, architect quality, and reduce RAM and disk usage. After six years as a database engineer, I can say that there are two or three common misconceptions when designing a relational database structure.

The choice of primary key data type (Serial vs UUID)

The type of column to use for auto-generated primary keys is a critical decision. UUID, Serial, and Identity columns are three popular choices. We'll go over each option with you and help you decide which one is best for you.

What are Identity columns?

Identity columns are similar to Serial columns, but they support multiple data types and have custom start and increment values. PostgreSQL 10 introduced identity columns, which are now part of the SQL standard.

When should you use Identity columns?

Identity columns are ideal when you require more functionality than Serial columns provide, such as custom start and increment values or support for multiple data types.

UUID (Universally Unique Identifier) columns generate unique identifiers using random or time-based algorithms. UUIDs are globally unique, meaning that they are unique across all systems and databases, making them ideal for distributed systems. UUIDs are typically 128 bits long, consisting of a combination of numbers and letters.

When should you use UUID columns?

UUID columns are ideal when you need to generate unique identifiers that are globally unique, and you don't want to rely on a centralized server to generate them. For example, if you're building a distributed system where each node generates its own data, UUID columns would be an excellent choice.

UUID Columns Performance Impact

Generating UUIDs can be computationally expensive, as the algorithm used to generate them is more complex than generating Serials. This can lead to slower performance when inserting or updating large numbers of rows.

UUIDs are typically 128 bits long, which can result in larger index sizes and slower query performance.

However, because UUIDs are globally unique, they can simplify database design and reduce the need for centralized servers to generate IDs.

Serial columns are auto-incrementing columns that generate unique identifiers sequentially. They are typically 32-bit integers and start at 1, incrementing by 1 for each new row. Serial columns are easy to use and manage, making them a popular choice for many PostgreSQL applications.

When should you use Serial columns?

Serial columns are ideal when you need to generate unique identifiers sequentially and don't require global uniqueness. For example, if you're building an application where each user generates their own data, a Serial column would be a good choice.

Serial Columns Performance Impact:

Generating Serials is less computationally expensive than generating UUIDs, as it involves simple arithmetic operations.

Serials are typically smaller in size (32 bits) than UUIDs, which can result in smaller index sizes and faster query performance.

However, because Serials generate IDs sequentially, they can cause performance issues when multiple concurrent transactions are inserting or updating rows. This is because PostgreSQL uses a table-level lock when generating Serial IDs, which can lead to contention and reduced concurrency.

In general, if you need globally unique identifiers, then UUIDs are a good choice. However, if you need sequential identifiers and high concurrency, then Serials may be a better choice.

It's important to test the performance of your database under different loads and workloads to determine the optimal choice for your specific needs.

What are Text and Varchar types?

To begin with, the text is not an SQL standard, but it is supported by several SQL database management systems.

Text and Varchar are character data types used to store alphanumeric data in a database. Text is a variable-length data type that can store up to 1GB of data, whereas Varchar is also a variable-length data type but has a maximum length limit. The maximum length for Varchar varies between different database management systems, such as PostgreSQL and SQL Server.

How does Postgres store the text and varchar data type?

The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.

In PostgreSQL, both the text and varchar data types are stored as variable-length character strings in the database.

When a text or varchar value is inserted into a PostgreSQL database, the database engine first checks the length of the string. If the length of the string is shorter than the maximum length specified for the varchar column, the string is stored as-is, with any unused space padded with null bytes. If the length of the string is longer than the maximum length specified for the varchar column, the string is truncated to fit the column length.

PostgreSQL uses a storage method called TOAST (The Oversized-Attribute Storage Technique) to store large values that exceed a certain size limit. When a text value is too large to fit in a regular table page, it is automatically stored in a separate TOAST table. The TOAST table is a separate table that is created by PostgreSQL and is used to store large values in a compressed format. When the value is retrieved, it is automatically decompressed and returned as a text value.

Varchar values that exceed the maximum length specified for the column are also stored in the TOAST table. However, varchar values that fit within the maximum length specified for the column are stored inline in the regular table page, just like text values.

It's worth noting that TOAST storage has some performance overhead, as there is a small amount of extra processing required to compress and decompress the data. However, in most cases, the performance impact is negligible and the benefits of being able to store large values outweigh the drawbacks.

What is the impact of choosing Text vs Varchar types on performance?

Choosing the appropriate data type can have a significant impact on the performance of the database. Here are some performance considerations for choosing between Text and Varchar types:

Storage space: Text types can take up more storage space than Varchar types, as they have no maximum length limit. This can impact the performance of the database by increasing disk I/O operations and slowing down backups and restores.

Indexing: Text types cannot be indexed in SQL Server, whereas Varchar types can be. This can impact the performance of the database when running queries that require text data to be indexed.

Query performance: The data type used can have an impact on query performance. In SQL Server, for example, a query that uses a Text data type may be slower than a query that uses a Varchar data type because you cannot add an index to it.

The text type may not have a direct impact on performance in PostgreSQL. According to their official document:

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact, character(n) is usually the slowest of the three because of its additional storage costs. In most situations, text or character varying should be used instead.

Eventually, selecting the appropriate data type for columns in a database table is a critical decision with significant implications for performance, storage, and functionality. While some data types may appear interchangeable at first glance, each has distinct characteristics that make it better suited for specific use cases.

For example, text and varchar data types are both used to store character data, but they differ in their maximum size limits and memory allocation. Text is better suited for storing large amounts of text data, while varchar is more efficient for storing smaller text values.

Similarly, choosing the appropriate numeric data type, such as integer or decimal, can have a significant impact on storage and performance. Integer data types are more efficient for storing whole numbers, while decimal data types are better suited for storing numbers with decimal places.

Furthermore, selecting the correct data type can have an impact on the database's functionality and accuracy. For example, storing dates in a date data type rather than a text data type allows the database to perform date-specific calculations and comparisons more efficiently and accurately.

To sum it all up, data type choice is a critical aspect of database design because it affects performance, storage, and functionality. As a result, it is critical to carefully consider the specific characteristics of each data type and select the appropriate data type based on the application's requirements and the data being stored. Developers and database administrators can ensure that their database performs optimally and meets the needs of their users by selecting the optimal data type for each column in a database table.