Skip to content

Conversation

@avery-cho-ai
Copy link

@avery-cho-ai avery-cho-ai commented Sep 22, 2023

We have the following redundant fields:

  • hierarchy_camel and hierarchy
  • hierarchy_radio_camel and hierarchy_radio
  • content_camel and content

Although these fields do not negatively impact our search performance they do inflate our record and index sizes. It seems as if removing these fields does cause a slight change to search behavior but not a substantial one.

I tested for potential impact of this change in 3 ways:

  • From manually testing popular search queries on a new index with the new scraper vs the existing production v1.42 index, there is no real difference. The only observable changes in the ranking of results is sometimes the order of results are swapped when they are ranked identically and the only tie-breaker is their objectID.
  • I also ran the existing search tests, and the score was extremely similar when compared to the tests running on prod. The overall score increased from 56.6421 to 57.3087 and this is due to only one product score change. The score for the Simian product changed from 9.25/30 to 9.9166/30. The cause of this change is unknown but could be from the change in order due to different objectIDs.
  • I created two smaller indices (by adjusting the start and stop urls) using this new commit version and the previous one and they appear to have the same 60 records and perform identically. When creating the full indices with each version of the scraper as tests, the number of records seems to not align but this seems to be unrelated to this change. For this reason, two small test indices were used to test this change.

@avery-cho-ai avery-cho-ai marked this pull request as ready for review September 22, 2023 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants