The classification of data (or the generation of breaks/classes of data) in the element symbology moduled offered in the GEMS application is quite limited and an improvement would benefit all the GEMS users. Currently, It gives us equal interval breaks but has zero knowledge of the data or it's distribution.
Say, I want to color-code diameter but there is no way to list all the unique diameters. Color-coding flow is another area it would benefit if some of the classifications that I research is implemented. The majority of the flows are smaller or smaller to medium but the way GEMS application breaks are based min and max values. Because of this approach, it generates breaks that represent the data very poorly.
I did a little experiment to see color coding with the common classification methods available in different software and apply those to at least two different networks.
Python programming language has very good data science modules so selecting python was no brainer and in order to get the coordinates and input/output attributes, I used WaterObjects.NET (WO.NET).
Data Classification methods that I selected:
• group_distribution: Divining the given range into n equal groups. This method is what GEMS application provides. Code• fixed_distribution: User specifies the intervals. Not used in the study as it needs manual classification. Code• unique_distribution: Unique values are generated from the given data set. Code• quantile_distribution: Dividing the data such that each group has equal probability. Code• jenks_natural_brek: The best arrangement of data into different groups. Code• percentile_distribution: The percentile of the data should fall into the given value. Code
Quick note: For comparison consistency, all data were classified into 5 groups except for unique distribution. This study is not to generate a beautiful color-coded network rather to find better groups (illustrated by colors) to represent the data of the network.
Conclusion: There is no clear winner, unfortunately. Depending on the data a certain type of distribution may represent them better. Unique is suited for data streams like diameter, C-factor- material, etc. Jenks might fit in most of the cases.
Code for NetworkA and with some description of the process.Code for NetworkB and with some description of the process.
In order to keep the post to minimal length, below is just the images of a lengthy document.
Network A:Let’s see the diameter first. This network (networkA) is very dense towards one value as it can be seen in the histogram
This is how the network looks like with different data classification. None of these “look” great because we have really skewed data where almost all diameters fall into one group.
--------------------------------------------------- NetworkB ---------------------------------------------------