Companies use different names for the same technology, and it can be confusing. For example:
- "Microsoft Azure" and "Azure" are the same thing.
I want to group these different names into clusters that represent the same skill or technology.
- I used clustering methods to group similar names.
- I made dendrograms to see how agglomerative clustering works.
- I tested some other methods to figure out what works best.
- This method builds clusters step-by-step (small groups first, then bigger).
- I tested different thresholds to find the best grouping.
Examples:
- At threshold 1 (
t_1.0), the cluster forAzureis too small: - At threshold 3 (
t_3.0), theAzurecluster is better: - But I cant do threshold 3 everywhere becouse for example at threshold 3:
- This method can handle clusters of different sizes.
- But it has problems with clusters that have different densities (spread out vs compact).
- I used this because DBSCAN wasn’t perfect.
- Still, there is more work needed.
- Cluster Search:
- I wrote a
trace_specific_clusterfunction, but it can be faster. It calculates clusters every time, but it should only calculate once.
- I wrote a
- My Knowledge:
- I don’t know all the technologies, so some clusters might not be super accurate.
- Make the code run faster.
- Fix the problems with clusters for different densities.


