Proposed Classification of Data in GEMS application

Over the course of years, I got a chance to play with multiple applications that are related to, showing data on a computer screen. Out of all the techniques, the GEMS application symbology is the best in terms of user-friendliness and versatility. Now, the part where I like to improve, again if I can, is the classification of the data that a user is interested in color-coding.

The classification of data (or the generation of breaks/classes of data) in the element symbology moduled offered in the GEMS application is quite limited and an improvement would benefit all the GEMS users. Currently, It gives us equal interval breaks but has zero knowledge of the data or it's distribution.

Say, I want to color-code diameter but there is no way to list all the unique diameters. Color-coding flow is another area it would benefit if some of the classifications that I research is implemented. The majority of the flows are smaller or smaller to medium but the way GEMS application breaks are based min and max values. Because of this approach, it generates breaks that represent the data very poorly.

I did a little experiment to see color coding with the common classification methods available in different software and apply those to at least two different networks.

Python programming language has very good data science modules so selecting python was no brainer and in order to get the coordinates and input/output attributes, I used WaterObjects.NET (WO.NET).

Data Classification methods that I selected:

• group_distribution: Divining the given range into n equal groups. This method is what GEMS application provides. Code
• 
fixed_distribution: User specifies the intervals. Not used in the study as it needs manual classification. Code
• unique_distribution: Unique values are generated from the given data set. Code
 quantile_distribution: Dividing the data such that each group has equal probability.  Code
 jenks_natural_brek: The best arrangement of data into different groups. Code
 percentile_distribution:  The percentile of the data should fall into the given value. Code

Quick note: For comparison consistency, all data were classified into 5 groups except for unique distribution. This study is not to generate a beautiful color-coded network rather to find better groups (illustrated by colors) to represent the data of the network.

Conclusion: There is no clear winner, unfortunately. Depending on the data a certain type of distribution may represent them better. Unique is suited for data streams like diameter, C-factor- material, etc. Jenks might fit in most of the cases. 



Code for NetworkA and with some description of the process.
Code for NetworkB and with some description of the process




In order to keep the post to minimal length, below is just the images of a lengthy document.

Network A:
Let’s see the diameter first. This network (networkA) is very dense towards one value as it can be seen in the histogram

This is how the network looks like with different data classification. None of these “look” great because we have really skewed data where almost all diameters fall into one group.

Flows:

Max Flows

Avg Flow:

Min Flow

Velocity:

Velocity Max


Velocity Average

Velocity Min

Headloss

Headloss Max

Headloss Average

Headloss Min

--------------------------------------------------- NetworkB --------------------------------------------------- 

Diameter

Flows:

Max Flows

Avg Flow:

Min Flow

Velocity:

Velocity Max


Velocity Average

Velocity Min

Headloss

Headloss Max

Headloss Average

Headloss Min