Skip to content

bug: whois query data is formatted differently for some tlds #25

@mackcoding

Description

@mackcoding

While doing tests, I found that some of the whois query is formatted differently. In general, we are parsing name:value. In this case, there might be a line break. Here's an example output:

Domain: google.de
Status: connect


    Domain name:
        bbc.co.uk

    Data validation:
        Nominet was able to match the registrant's name and address against a 3rd party data source on 12-Jun-2014

    Registrar:
        British Broadcasting Corporation [Tag = BBC]
        URL: https://www.bbc.co.uk

    Relevant dates:
        Registered on: before Aug-1996
        Expiry date:  13-Dec-2025
        Last updated:  10-Dec-2020

    Registration status:
        Registered until expiry date.

    Name servers:
        ddns0.bbc.co.uk           148.163.199.1  2607:f740:e04e::1
        ddns0.bbc.com
        ddns1.bbc.co.uk           148.163.199.65  2607:f740:e04e:4::1
        ddns1.bbc.com
        dns0.bbc.co.uk            198.51.44.9  2620:4d:4000:6259:7:9:0:1
        dns0.bbc.com
        dns1.bbc.co.uk            198.51.45.9  2a00:edc0:6259:7:9::2
        dns1.bbc.com

    WHOIS lookup made at 06:22:57 18-Oct-2024

The tokenizer needs to be updated to detect this. From what I can tell:

  1. First line is marked as name:
  2. Second line is the data.
  3. Line break indicates new item

There are other situations that do not comply to that ruleset:

Domain:             google.it
Status:             ok
Signed:             no
Created:            1999-12-10 00:00:00
Last Update:        2024-09-27 00:50:20
Expire Date:        2025-04-21

Registrant
  Organization:     Google Ireland Holdings Unlimited Company
  Address:          70 Sir John Rogerson's Quay
                    Dublin
                    2
                    Dublin
                    IE
  Created:          2018-03-02 19:04:02
  Last Update:      2018-03-02 19:04:02

Admin Contact
  Name:             Colm Buckley
  Organization:     Google LLC
  Address:          1600 Amphitheatre Parkway
                    Mountain View
                    94043
                    CA
                    US
  Created:          2024-09-27 00:44:25
  Last Update:      2024-09-27 00:44:25

Technical Contacts
  Name:             Domain Administrator
  Organization:     Google LLC
  Address:          1600 Amphitheatre Parkway
                    Mountain View
                    94043
                    CA
                    US
  Created:          2017-12-21 19:54:04
  Last Update:      2017-12-21 19:54:04

Registrar
  Organization:     MarkMonitor International Limited
  Name:             MARKMONITOR-REG
  Web:              https://www.markmonitor.com/
  DNSSEC:           no


Nameservers
  ns1.google.com
  ns2.google.com
  ns3.google.com
  ns4.google.com

This is similar to above, but here we have a situation where each section has a header (such as Technical Contacts). And in some cases, such as Address, there are multiple lines.

We need to build cases to handle this in the tokenizer. Attached is a list of whois queries for 100 different tld's.

whois.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

Status

In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions