Some people may be curious whether we choose Excel or SQL in our projects, and what the differences between them are in terms of both definition and practical application. However, the mutual similarity is that they are used for Data Organisation.
“Data Organisation” refers to the method of classifying data sets to make them more useful
The purpose of this blog is to compare Excel and SQL as well as to demonstrate real-world examples in insurance industry. I hope this blog will make readers better understand these tools and be able to implement in their projects effectively.
Spreadsheet and traditional relational database : the Crash Concept
I believe that almost most people used to spreadsheet application like Excel or Google Sheet. The main feature is to represent data in the form of spreadsheet, which composes of rows and columns, and to demonstrate these data sets as graph or histogram.
Looking into SQL, it is query, which is commonly used to interact with
Traditional Relational Database , which can be seen as many excel spreadsheets, which have related relationship, grouped together as one database so that the redundancy of the data can be reduced.
Nevertheless, before writing SQL language in order to create, read, update or delete data from the database (aka CRUD) , we have to define the relationship. we can explain the structure of the database in
Entity Relationship Diagram or ERD . This diagram is standard and widely used.
For instance, it is assumed that many monthly claim can occur and be recorded in one policy (one to many relationship). We can store the data in spreadsheet like this:
It is clearly seen that the spreadsheet has redundancy in columns of
accident_date, reported_date and closed_date, and these can be fixed ,according to the principle of Traditional Relational Database by separating them to 2 tables, and add policy_id column in order to reference to id column of Policy Table as below:
To sum up, we can write the table relationship in term of ERD as below:
- ||—–o< means one to many relationship
- PK stands for Primary Key , and this key must not have duplicate values and must not have value NULL. The purpose of these rule is to prevent the duplicate data in the database and make the primary key get index. In particular, when you tell the database to insert a new row, it has to check the new row’s primary key value against every existing primary key values to ensure that the new value is not there yet. There’s no faster way do achieve this than by figuring out the value in the indexes.
- FK stands for Foreign Key . It is for connecting the relationship between 2 tables together. For example, the row of Policy Table, which has id (PK) =1, can be combined with the row of Monthly_Claim Table, which has the same policy_id (FK) =1.
By the way, one to many relationship is not the only relationship in the world. If you want to learn more about ERD and how we can define the relationship, I highly recommend to check this link out:
SQL or EXCEL ? :the Practical Detailed Comparison
The emergence of SQL made human can write query or instruction easier. We can extract data that we only want not all from the table. To give an example, supposed we have database : Table Monthly_Claim in our system as I used as an example above, we can simply write query as following:
SELECT paid_claim FROM Monthly_Claim WHERE paid_claim BETWEEN 130 AND 150 ORDER BY paid_claim DESC LIMIT 10;
Then, the result can be seen as new spreadsheet which includes the 10 greatest amounts of claims as:
Let see the code in detail
- SELECT is a keyword that tells SQL in order to query the data from the database.
- paid_claim is columns in Monthly_Claim table.
- FROM is a keyword that tells query to look up in Monthly_Claim table.
- WHERE is a keyword which filters the data as condition we specified. In our case, it is only the rows whose paid_claim has values 130 between 150.
- ORDER BY is a keyword which orders the obtained result in ascending or descending order . In our instance, it is
in descending order, so we use keyword DESC.
- LIMIT is keyword to limit the amount of shown result.
If you’re an Excel power-use , the instruction are:
- Open Monthly_Claim table , then copy all data into Excel spreadsheet
- Use FILTER or IF function to every row of paid_claim column to handle conditional logic which is that paid_claim must be between 130 and 150
- Now, use Excel feature Sort & Filter for the purpose of order
- Finally, copy only first 10 rows , then delete unwanted rows
- Alternatively, all above steps can be implemented by VBA .
It can be summarised that although Excel gives more flexibility and simplicity , it has more complex steps users have to work on. Moreover, If there is new data, we have to repeat the same instruction again, so that error can easily occur. Last but not least, provided we need to collaborate with our teammate, we have to add comment in worksheet in order to make our teammate understand what we do. All of I mentioned is quite waste of time.
Whereas, SQL users just code in a few lines, then they get the same result. Furthermore, they can reuse their code and do not have to change anything, and the code is much easier to understand. This is because its syntax is quite similar to English language.
Performance and Speed
Excel has limitation which is that the more amount of data and the more complexity of the functions which Excel users use, the worse performance and speed they experience. Also, Excel can run the program at most 1 million row.
Compared to SQL, it can be up to billion rows and its performance is much faster. In some cases. with SQL, the query can be executed within a couple of minutes, whereas it take an hour in Excel.
You can imagine if the amount of data continues growing, the size of Excel files becomes larger as well. It may be dozens of Mb. It certainly appears slow and difficult, especially when uploading and sending these files to teammates. Apart from this, there is concern about naming convention and version control. I believe most Excel user often get the problem like I shown below:
Nonetheless, this kind of problem could be mitigated if you change to use Google Spreadsheets.
Turning into SQL, it stores only a piece of text query(small size of code or instruction ), not keeping a large size of data in file like Excel. As a result, everyone in the team can access the same database. Now, everybody can run their own queries, and they do not affect one another. This means that version control is no longer necessary.
Plus, knowing SQL makes us speak the same language the IT professionals who take care the database roles, following smoothly collaboration.
Since Excel is Microsoft Office Product, it provides a good and easy-to-use UI (User interaction). Although people who never programmed before, they can easily use it . They just try clicking and using keyboard shortcuts, and they can master Excel in a short time.
SQL is not so hard too, but the most essential part is to understand the type of data and its relationship. If you want to dig deep into SQL, my recommendation is to study as following:
- Basic SQL Command
- SQL JOIN
- SQL Aggregate Function
- SQL Subqueries
- Data cleaning with SQL
- SQL Window Function
- SQL Performance Tuning
My favourite website to study SQL is tutorialspoint.com/sql .
Adoption with Other Tools
In Excel, data visualisation is a piece of cake, because Microsoft has already especially prepared built-in feature. Excel users can quickly and effectively create various graphs such as line chart, bar chart, histogram , or even time series. We can then export these graphs to other Microsoft Office programs , namely Powerpoint and Word.
One of well-known in data-driven library is D3.js. It helps us to unlimitedly design data visualisation documents. Its results are even more beautiful and flexible compared to Excel ones. Many developers use it to design and create new kinds of chart such as Zoomable Geography Graph, Sequence Sunburst Graph, etc.
But if you interest, you can navigate to D3.js by this link: https://d3js.org/ .
Connection with Database
Again, For Excel users, the easiest way to access the database is via Microsoft Access. It does not only help us to easily create Relational Database and provide built-in SQL query features, but it also allow ability to use the result from those queries to be continuously apllied next in Excel. Still, it is not free, and it can be bought from Microsoft 365 package.
At the same time, if we use SQL, we have more options when we consider accessing traditional database. The famous tools are Access, MySQL, PostgreSQL , etc.
Extended to New feature such as Machine learning
If the Excel users want to use some not-built-in feature , they have to buy add-ons from Microsoft. Besides, its machine learning feature has some limitation, so I do not suggest Excel for this work.
Similarly, SQL is not a language for machine learning, since the top languages to develop machine learning are R and Python. Yet, SQL opens a door to machine leaning, predictive modelling, data science, and AI development by reason of its nature of compatibility with other famous open source libraries.
Excel is fit for easy tasks, not complex ones, individual work or final task, which need not to forward to other teammates.
SQL is appropriate for complex, long-term and big projects, which have a frequent growing amount of data. Also, You need to take an account of collaboration concern.
You sometimes can use both Excel and SQL at the same time. As an illustration, you can write a SQL query to read the data, then plot it to Histogram with Excel.
When considering whether Excel or SQL should be implemented, one thing you have to consider is the simplicity to access and usage by yourself and your team. By instance, despite the fact that the project we currently work is really suitable for SQL, It makes no sense to implement SQL in your project if your team never have experience with SQL and database before.
Last but not least,
DO NOT choose the tool simply because others told that it is good. It may not be the best for you. We should use the tool that can make our work done smoothly with no errors and within the deadline.