diff --git a/README.md b/README.md index a8a66cd..4b263ad 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,9 @@ uses perform various tasks related to the Grove platform. database in Atlas. - `dodec`, or the Database of Devoured Example Code: a query tool that lets us find code examples and related metadata in the database for reporting or to perform manual updates. +- `create-url-list`: A Go CLI tool that extracts and ranks URLs by pageviews from CSV data containing page analytics. - `dependency-manager`: A Go CLI project to help us manage dependencies for multiple ecosystems in the docs monorepo +- `get-docs-markdown`: A Go CLI tool that downloads the markdown versions of documentation pages from an input csv file. - `github-metrics`: a Node.js script that gets engagement metrics from GitHub for specified repos and writes them to a database in Atlas. - `query-docs-feedback`: a Go project with type definitions that queries the MongoDB diff --git a/get-docs-markdown/.gitignore b/get-docs-markdown/.gitignore new file mode 100644 index 0000000..106d2e1 --- /dev/null +++ b/get-docs-markdown/.gitignore @@ -0,0 +1 @@ +get-docs-markdown diff --git a/get-docs-markdown/README.md b/get-docs-markdown/README.md new file mode 100644 index 0000000..c895def --- /dev/null +++ b/get-docs-markdown/README.md @@ -0,0 +1,122 @@ +# get-docs-markdown + +A Go utility that downloads markdown versions of MongoDB documentation pages from a CSV file. + +## Overview + +This tool reads a CSV file containing MongoDB documentation URLs (typically output from `create-url-list`) and downloads the markdown version of each page. + +## Usage + +```bash +./get-docs-markdown -csv -output [options] +``` + +### Flags + +- `-csv`: (Required) Path to the CSV file containing URLs +- `-output`: (Optional) Output directory for markdown files (default: `markdown-output`) +- `-workers`: (Optional) Number of concurrent download workers (default: `10`) +- `-rate-limit`: (Optional) Maximum requests per second (default: `5.0`, use `0` for unlimited) + +### Examples + +```bash +# Build the tool +go build + +# Download markdown files with default settings (10 workers, 5 req/sec) +./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files + +# Use more workers and higher rate limit for faster downloads +./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files -workers 20 -rate-limit 10 + +# Conservative settings to avoid server load +./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files -workers 5 -rate-limit 2 +``` + +## CSV Format + +The tool expects a CSV file with the following format: + +``` +Rank,Page,Number of Page Views +1,www.mongodb.com/docs/manual/administration/install-community/,55197 +2,www.mongodb.com/docs/get-started/,45669 +``` + +Input CSVs may include or omit the header row. If a header row is present, the tool will skip it. + +The tool reads the URL from the second column (index 1). + +## How It Works + +1. **CSV Reading**: Reads all URLs from the CSV file into memory + +2. **Concurrent Processing**: Spawns multiple worker goroutines (default: 10) to download files in parallel + +3. **Rate Limiting**: Uses a token bucket algorithm to limit requests per second (default: 5 req/sec) + - Prevents overwhelming the server + - Ensures respectful crawling behavior + +4. **URL Processing**: For each URL: + - Removes trailing slashes + - Removes query parameters and anchor tags + - Adds `.md` extension to get the markdown version + - Adds User-Agent header to avoid 503 errors + +5. **Slug Extraction**: Extracts the page slug from the URL (everything after `www.mongodb.com/docs/`) + - Includes language and version prefixes to ensure uniqueness + - Examples: + - `www.mongodb.com/docs/manual/installation/` → `manual/installation` + - `www.mongodb.com/zh-cn/docs/manual/installation/` → `zh-cn/manual/installation` + - `www.mongodb.com/docs/v7.0/manual/installation/` → `v7.0/manual/installation` + +6. **File Naming**: Saves files as `/.md` + - Preserves directory structure from the URL path including language/version prefixes + - Skips download if file already exists + - Examples: `manual/installation.md`, `zh-cn/manual/installation.md`, `v7.0/manual/installation.md` + +7. **Download**: Downloads the markdown content and saves it to the output directory + +## Output + +The tool creates a directory structure matching the URL paths, including language and version prefixes: + +``` +markdown-output/ +├── manual/ +│ ├── administration/ +│ │ └── install-community.md +│ └── reference/ +│ └── connection-string.md +├── zh-cn/ +│ └── manual/ +│ └── administration/ +│ └── install-community.md +├── v7.0/ +│ └── administration/ +│ └── install-community.md +├── get-started.md +├── mongodb-shell/ +│ └── install.md +└── compass/ + └── install.md +``` + +This ensures that different language versions and versioned documentation are saved separately without conflicts. + +## Error Handling + +- If a URL cannot be downloaded (404, network error, etc.), the tool logs the error and continues with the next URL +- At the end, it reports the number of successful downloads and errors + +## Performance + +With default settings (10 workers, 5 req/sec): +- **250 URLs**: ~50 seconds +- **500 URLs**: ~100 seconds (1.7 minutes) +- **1000 URLs**: ~200 seconds (3.3 minutes) + +You can adjust `-workers` and `-rate-limit` to balance speed vs. server load. Higher values will download faster but may risk rate limiting or server errors. + diff --git a/get-docs-markdown/go.mod b/get-docs-markdown/go.mod new file mode 100644 index 0000000..6d0b46e --- /dev/null +++ b/get-docs-markdown/go.mod @@ -0,0 +1,5 @@ +module get-docs-markdown + +go 1.25.4 + +require golang.org/x/time v0.14.0 // indirect diff --git a/get-docs-markdown/go.sum b/get-docs-markdown/go.sum new file mode 100644 index 0000000..ed7d3d5 --- /dev/null +++ b/get-docs-markdown/go.sum @@ -0,0 +1,2 @@ +golang.org/x/time v0.14.0 h1:MRx4UaLrDotUKUdCIqzPC48t1Y9hANFKIRpNx+Te8PI= +golang.org/x/time v0.14.0/go.mod h1:eL/Oa2bBBK0TkX57Fyni+NgnyQQN4LitPmob2Hjnqw4= diff --git a/get-docs-markdown/main.go b/get-docs-markdown/main.go new file mode 100644 index 0000000..17b99c5 --- /dev/null +++ b/get-docs-markdown/main.go @@ -0,0 +1,367 @@ +package main + +import ( + "context" + "encoding/csv" + "flag" + "fmt" + "io" + "net/http" + "net/url" + "os" + "path/filepath" + "strings" + "sync" + "time" + + "golang.org/x/time/rate" +) + +func main() { + // Define command-line flags + csvFile := flag.String("csv", "", "Path to CSV file containing URLs") + outputDir := flag.String("output", "markdown-output", "Output directory for markdown files") + workers := flag.Int("workers", 10, "Number of concurrent download workers") + rateLimit := flag.Float64("rate-limit", 5.0, "Maximum requests per second (0 for unlimited)") + flag.Parse() + + if *csvFile == "" { + fmt.Println("Error: -csv flag is required") + flag.Usage() + os.Exit(1) + } + + // Create output directory if it doesn't exist + if err := os.MkdirAll(*outputDir, 0755); err != nil { + fmt.Printf("Error creating output directory: %v\n", err) + os.Exit(1) + } + + // Create rate limiter + var limiter *rate.Limiter + if *rateLimit > 0 { + limiter = rate.NewLimiter(rate.Limit(*rateLimit), 1) + fmt.Printf("Rate limiting enabled: %.1f requests/second\n", *rateLimit) + } else { + limiter = rate.NewLimiter(rate.Inf, 0) + fmt.Println("Rate limiting disabled") + } + + // Read and process CSV file + if err := processCSV(*csvFile, *outputDir, *workers, limiter); err != nil { + fmt.Printf("Error processing CSV: %v\n", err) + os.Exit(1) + } + + fmt.Println("\nDownload complete!") +} + +// downloadJob represents a download task +type downloadJob struct { + url string + lineNum int +} + +// downloadResult represents the result of a download +type downloadResult struct { + url string + lineNum int + err error +} + +// isHeaderRow detects if a CSV row is a header row +func isHeaderRow(record []string) bool { + if len(record) < 2 { + return false + } + + // Check for common header patterns from create-url-list output + // Output headers: "Rank", "Page", "Number of Page Views" + // Input headers: "Page", "Page Subsite", "Measure Names", "Measure Values", "Min. Aux" + + // Check first and second columns + firstCol := strings.ToLower(strings.TrimSpace(record[0])) + secondCol := strings.ToLower(strings.TrimSpace(record[1])) + + // If either column looks like a URL (contains mongodb.com or www.), it's not a header + if strings.Contains(firstCol, "mongodb.com") || strings.Contains(firstCol, "www.") { + return false + } + if strings.Contains(secondCol, "mongodb.com") || strings.Contains(secondCol, "www.") { + return false + } + + // Common header patterns in first column + firstColKeywords := []string{"rank", "page"} + for _, keyword := range firstColKeywords { + if firstCol == keyword { + return true + } + } + + // Common header patterns in second column + secondColKeywords := []string{"page", "measure", "url"} + for _, keyword := range secondColKeywords { + if strings.Contains(secondCol, keyword) { + return true + } + } + + // If the second column doesn't contain a dot, it's likely not a URL + if !strings.Contains(secondCol, ".") { + return true + } + + return false +} + +// processCSV reads the CSV file and downloads markdown for each URL using concurrent workers +func processCSV(csvPath, outputDir string, numWorkers int, limiter *rate.Limiter) error { + file, err := os.Open(csvPath) + if err != nil { + return fmt.Errorf("failed to open CSV file: %w", err) + } + defer file.Close() + + // Read all URLs from CSV first + reader := csv.NewReader(file) + var jobs []downloadJob + lineNum := 0 + + for { + record, err := reader.Read() + if err == io.EOF { + break + } + if err != nil { + return fmt.Errorf("error reading CSV at line %d: %w", lineNum, err) + } + + lineNum++ + + // Skip header row if present + // Common headers: "Rank,Page" or "Rank,Page,Number of Page Views" or "Page,Page Subsite,..." + if lineNum == 1 && isHeaderRow(record) { + fmt.Println("Detected and skipping header row") + continue + } + + // Expect format: rank,url,metric (or just url,metric) + if len(record) < 2 { + fmt.Printf("Warning: Line %d has insufficient columns, skipping\n", lineNum) + continue + } + + jobs = append(jobs, downloadJob{ + url: record[1], + lineNum: lineNum, + }) + } + + if len(jobs) == 0 { + fmt.Println("No URLs to process") + return nil + } + + fmt.Printf("Processing %d URLs with %d workers...\n", len(jobs), numWorkers) + + // Create channels for jobs and results + jobChan := make(chan downloadJob, len(jobs)) + resultChan := make(chan downloadResult, len(jobs)) + + // Start worker goroutines + var wg sync.WaitGroup + ctx := context.Background() + for i := 0; i < numWorkers; i++ { + wg.Add(1) + go worker(ctx, i+1, jobChan, resultChan, outputDir, limiter, &wg) + } + + // Send jobs to workers + for _, job := range jobs { + jobChan <- job + } + close(jobChan) + + // Wait for all workers to finish + go func() { + wg.Wait() + close(resultChan) + }() + + // Collect results + successCount := 0 + errorCount := 0 + for result := range resultChan { + if result.err != nil { + fmt.Printf("Error downloading %s: %v\n", result.url, result.err) + errorCount++ + } else { + successCount++ + } + } + + fmt.Printf("\nProcessed %d URLs: %d successful, %d errors\n", len(jobs), successCount, errorCount) + return nil +} + +// worker processes download jobs from the job channel +func worker(ctx context.Context, id int, jobs <-chan downloadJob, results chan<- downloadResult, outputDir string, limiter *rate.Limiter, wg *sync.WaitGroup) { + defer wg.Done() + + for job := range jobs { + // Wait for rate limiter before making request + if err := limiter.Wait(ctx); err != nil { + results <- downloadResult{ + url: job.url, + lineNum: job.lineNum, + err: fmt.Errorf("rate limiter error: %w", err), + } + continue + } + + err := downloadMarkdown(job.url, outputDir) + results <- downloadResult{ + url: job.url, + lineNum: job.lineNum, + err: err, + } + } +} + +// downloadMarkdown downloads the markdown version of a documentation page +func downloadMarkdown(pageURL, outputDir string) error { + // Extract the page slug and construct markdown URL + slug, err := extractPageSlug(pageURL) + if err != nil { + return fmt.Errorf("failed to extract page slug: %w", err) + } + + // Check if file already exists + outputPath := filepath.Join(outputDir, slug+".md") + if _, err := os.Stat(outputPath); err == nil { + // File already exists, skip download + return nil + } + + mdURL, err := constructMarkdownURL(pageURL) + if err != nil { + return fmt.Errorf("failed to construct markdown URL: %w", err) + } + + // Create HTTP request with User-Agent header + req, err := http.NewRequest("GET", mdURL, nil) + if err != nil { + return fmt.Errorf("failed to create request: %w", err) + } + req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; get-docs-markdown/1.0)") + + // Download the markdown content + client := &http.Client{ + Timeout: 30 * time.Second, + } + resp, err := client.Do(req) + if err != nil { + return fmt.Errorf("failed to download: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return fmt.Errorf("HTTP error: %d %s", resp.StatusCode, resp.Status) + } + + // Create subdirectories if needed + if err := os.MkdirAll(filepath.Dir(outputPath), 0755); err != nil { + return fmt.Errorf("failed to create subdirectories: %w", err) + } + + outFile, err := os.Create(outputPath) + if err != nil { + return fmt.Errorf("failed to create output file: %w", err) + } + defer outFile.Close() + + // Write content to file + if _, err := io.Copy(outFile, resp.Body); err != nil { + return fmt.Errorf("failed to write file: %w", err) + } + + return nil +} + +// extractPageSlug extracts the page slug from a URL (everything after www.mongodb.com/docs/) +// Includes language and version prefixes to ensure uniqueness +func extractPageSlug(pageURL string) (string, error) { + // Remove protocol if present + pageURL = strings.TrimPrefix(pageURL, "http://") + pageURL = strings.TrimPrefix(pageURL, "https://") + + // Find the position of www.mongodb.com/ + domainIndex := strings.Index(pageURL, "www.mongodb.com/") + if domainIndex == -1 { + return "", fmt.Errorf("URL does not contain 'www.mongodb.com/': %s", pageURL) + } + + // Extract everything after www.mongodb.com/ + afterDomain := pageURL[domainIndex+16:] // +16 to skip "www.mongodb.com/" + + // Check if there's a language/version prefix before /docs/ + docsIndex := strings.Index(afterDomain, "docs/") + if docsIndex == -1 { + return "", fmt.Errorf("URL does not contain 'docs/': %s", pageURL) + } + + // Include any language/version prefix before docs/ + var slug string + if docsIndex > 0 { + // There's a prefix (e.g., "zh-cn/" or "v7.0/") + prefix := afterDomain[:docsIndex] + prefix = strings.Trim(prefix, "/") + afterDocs := afterDomain[docsIndex+5:] // +5 to skip "docs/" + slug = prefix + "/" + afterDocs + } else { + // No prefix, just extract after docs/ + slug = afterDomain[docsIndex+5:] // +5 to skip "docs/" + } + + // Remove query parameters and anchors first + if idx := strings.IndexAny(slug, "?#"); idx != -1 { + slug = slug[:idx] + } + + // Then remove trailing slash + slug = strings.TrimSuffix(slug, "/") + + // If slug is empty, use "index" + if slug == "" { + slug = "index" + } + + return slug, nil +} + +// constructMarkdownURL creates the markdown URL from a page URL +func constructMarkdownURL(pageURL string) (string, error) { + // Add https:// if not present + if !strings.HasPrefix(pageURL, "http://") && !strings.HasPrefix(pageURL, "https://") { + pageURL = "https://" + pageURL + } + + // Parse the URL + parsedURL, err := url.Parse(pageURL) + if err != nil { + return "", fmt.Errorf("failed to parse URL: %w", err) + } + + // Remove query parameters and fragment + parsedURL.RawQuery = "" + parsedURL.Fragment = "" + + // Remove trailing slash from path + parsedURL.Path = strings.TrimSuffix(parsedURL.Path, "/") + + // Add .md extension + parsedURL.Path += ".md" + + return parsedURL.String(), nil +} diff --git a/get-docs-markdown/main_test.go b/get-docs-markdown/main_test.go new file mode 100644 index 0000000..f2cb6f0 --- /dev/null +++ b/get-docs-markdown/main_test.go @@ -0,0 +1,307 @@ +package main + +import ( + "path/filepath" + "strings" + "testing" +) + +// TestExtractPageSlug tests the extractPageSlug function with various URL formats +func TestExtractPageSlug(t *testing.T) { + tests := []struct { + name string + url string + wantSlug string + wantError bool + }{ + { + name: "simple URL", + url: "www.mongodb.com/docs/manual/installation/", + wantSlug: "manual/installation", + wantError: false, + }, + { + name: "URL with trailing slash", + url: "www.mongodb.com/docs/atlas/getting-started/", + wantSlug: "atlas/getting-started", + wantError: false, + }, + { + name: "URL without trailing slash", + url: "www.mongodb.com/docs/compass/install", + wantSlug: "compass/install", + wantError: false, + }, + { + name: "URL with query params", + url: "www.mongodb.com/docs/manual/reference/?tab=cloud", + wantSlug: "manual/reference", + wantError: false, + }, + { + name: "URL with anchor", + url: "www.mongodb.com/docs/manual/reference/#section", + wantSlug: "manual/reference", + wantError: false, + }, + { + name: "URL with language prefix", + url: "www.mongodb.com/zh-cn/docs/manual/installation/", + wantSlug: "zh-cn/manual/installation", + wantError: false, + }, + { + name: "URL with version prefix", + url: "www.mongodb.com/docs/v7.0/administration/install-community/", + wantSlug: "v7.0/administration/install-community", + wantError: false, + }, + { + name: "URL with pt-br prefix", + url: "www.mongodb.com/pt-br/docs/manual/installation/", + wantSlug: "pt-br/manual/installation", + wantError: false, + }, + { + name: "URL with https protocol", + url: "https://www.mongodb.com/docs/manual/installation/", + wantSlug: "manual/installation", + wantError: false, + }, + { + name: "URL with http protocol", + url: "http://www.mongodb.com/docs/manual/installation/", + wantSlug: "manual/installation", + wantError: false, + }, + { + name: "URL without docs path", + url: "www.mongodb.com/products/", + wantSlug: "", + wantError: true, + }, + { + name: "URL without domain", + url: "/docs/manual/installation/", + wantSlug: "", + wantError: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + slug, err := extractPageSlug(tt.url) + if tt.wantError { + if err == nil { + t.Errorf("extractPageSlug(%q) expected error, got nil", tt.url) + } + } else { + if err != nil { + t.Errorf("extractPageSlug(%q) unexpected error: %v", tt.url, err) + } + if slug != tt.wantSlug { + t.Errorf("extractPageSlug(%q) = %q, want %q", tt.url, slug, tt.wantSlug) + } + } + }) + } +} + +// TestConstructMarkdownURL tests the constructMarkdownURL function +func TestConstructMarkdownURL(t *testing.T) { + tests := []struct { + name string + url string + wantURL string + }{ + { + name: "simple URL", + url: "www.mongodb.com/docs/manual/installation/", + wantURL: "https://www.mongodb.com/docs/manual/installation.md", + }, + { + name: "URL with query params", + url: "www.mongodb.com/docs/manual/reference/?tab=cloud", + wantURL: "https://www.mongodb.com/docs/manual/reference.md", + }, + { + name: "URL with anchor", + url: "www.mongodb.com/docs/manual/reference/#section", + wantURL: "https://www.mongodb.com/docs/manual/reference.md", + }, + { + name: "URL with language prefix", + url: "www.mongodb.com/zh-cn/docs/manual/installation/", + wantURL: "https://www.mongodb.com/zh-cn/docs/manual/installation.md", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + mdURL, err := constructMarkdownURL(tt.url) + if err != nil { + t.Errorf("constructMarkdownURL(%q) unexpected error: %v", tt.url, err) + } + if mdURL != tt.wantURL { + t.Errorf("constructMarkdownURL(%q) = %q, want %q", tt.url, mdURL, tt.wantURL) + } + }) + } +} + +// TestIsHeaderRow tests the isHeaderRow function +func TestIsHeaderRow(t *testing.T) { + tests := []struct { + name string + record []string + wantBool bool + }{ + { + name: "output header with rank and page", + record: []string{"Rank", "Page"}, + wantBool: true, + }, + { + name: "output header with pageviews", + record: []string{"Rank", "Page", "Number of Page Views"}, + wantBool: true, + }, + { + name: "input header from analytics", + record: []string{"Page", "Page Subsite", "Measure Names", "Measure Values", "Min. Aux"}, + wantBool: true, + }, + { + name: "data row with rank and URL", + record: []string{"1", "www.mongodb.com/docs/manual/installation/", "55197"}, + wantBool: false, + }, + { + name: "data row without rank", + record: []string{"www.mongodb.com/docs/manual/installation/", "55197"}, + wantBool: false, + }, + { + name: "header with URL keyword", + record: []string{"URL", "Pageviews"}, + wantBool: true, + }, + { + name: "single column", + record: []string{"Rank"}, + wantBool: false, + }, + { + name: "empty record", + record: []string{}, + wantBool: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + result := isHeaderRow(tt.record) + if result != tt.wantBool { + t.Errorf("isHeaderRow(%v) = %v, want %v", tt.record, result, tt.wantBool) + } + }) + } +} + +// TestProcessCSV_FileNotFound tests that processCSV returns an error for non-existent files +func TestProcessCSV_FileNotFound(t *testing.T) { + tmpDir := t.TempDir() + // Note: processCSV requires a rate limiter, but we're just testing file opening + // The function will fail before using the limiter + err := processCSV("testdata/nonexistent.csv", tmpDir, 1, nil) + if err == nil { + t.Error("processCSV() expected error for non-existent file, got nil") + } +} + +// TestProcessCSV_EmptyFile tests that processCSV handles empty CSV files +func TestProcessCSV_EmptyFile(t *testing.T) { + tmpDir := t.TempDir() + // Note: processCSV requires a rate limiter, but empty file will fail before using it + err := processCSV("testdata/empty.csv", tmpDir, 1, nil) + // Empty file should not cause an error, just process zero URLs + if err != nil && !strings.Contains(err.Error(), "EOF") { + t.Errorf("processCSV() unexpected error for empty file: %v", err) + } +} + +// Note: Full integration tests that call processCSV with actual downloads +// are skipped because they require network access and would be slow. +// The unit tests above cover the core logic without network dependencies. + +// TestSlugUniqueness tests that different language/version URLs produce unique slugs +func TestSlugUniqueness(t *testing.T) { + urls := []string{ + "www.mongodb.com/docs/manual/installation/", + "www.mongodb.com/zh-cn/docs/manual/installation/", + "www.mongodb.com/pt-br/docs/manual/installation/", + "www.mongodb.com/docs/v7.0/administration/install-community/", + } + + slugs := make(map[string]bool) + for _, url := range urls { + slug, err := extractPageSlug(url) + if err != nil { + t.Errorf("extractPageSlug(%q) unexpected error: %v", url, err) + continue + } + + if slugs[slug] { + t.Errorf("Duplicate slug %q for URL %q", slug, url) + } + slugs[slug] = true + } + + // Should have 4 unique slugs + if len(slugs) != 4 { + t.Errorf("Expected 4 unique slugs, got %d", len(slugs)) + } +} + +// TestFilePathGeneration tests that file paths are generated correctly +func TestFilePathGeneration(t *testing.T) { + tests := []struct { + name string + url string + outputDir string + expectedPath string + }{ + { + name: "simple path", + url: "www.mongodb.com/docs/manual/installation/", + outputDir: "/tmp/output", + expectedPath: "/tmp/output/manual/installation.md", + }, + { + name: "with language prefix", + url: "www.mongodb.com/zh-cn/docs/manual/installation/", + outputDir: "/tmp/output", + expectedPath: "/tmp/output/zh-cn/manual/installation.md", + }, + { + name: "with version prefix", + url: "www.mongodb.com/docs/v7.0/administration/install/", + outputDir: "/tmp/output", + expectedPath: "/tmp/output/v7.0/administration/install.md", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + slug, err := extractPageSlug(tt.url) + if err != nil { + t.Fatalf("extractPageSlug(%q) unexpected error: %v", tt.url, err) + } + + path := filepath.Join(tt.outputDir, slug+".md") + if path != tt.expectedPath { + t.Errorf("Expected path %q, got %q", tt.expectedPath, path) + } + }) + } +} diff --git a/get-docs-markdown/testdata/empty.csv b/get-docs-markdown/testdata/empty.csv new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/get-docs-markdown/testdata/empty.csv @@ -0,0 +1 @@ + diff --git a/get-docs-markdown/testdata/input-headers.csv b/get-docs-markdown/testdata/input-headers.csv new file mode 100644 index 0000000..8335a7f --- /dev/null +++ b/get-docs-markdown/testdata/input-headers.csv @@ -0,0 +1,4 @@ +Page,Page Subsite,Measure Names,Measure Values,Min. Aux +www.mongodb.com/docs/manual/installation/,Docs,Pageviews,55197,1 +www.mongodb.com/docs/atlas/getting-started/,Docs,Pageviews,45123,1 + diff --git a/get-docs-markdown/testdata/invalid-format.csv b/get-docs-markdown/testdata/invalid-format.csv new file mode 100644 index 0000000..27baf59 --- /dev/null +++ b/get-docs-markdown/testdata/invalid-format.csv @@ -0,0 +1,3 @@ +1 +2,www.mongodb.com/docs/manual/installation/ + diff --git a/get-docs-markdown/testdata/simple.csv b/get-docs-markdown/testdata/simple.csv new file mode 100644 index 0000000..4d6ac91 --- /dev/null +++ b/get-docs-markdown/testdata/simple.csv @@ -0,0 +1,3 @@ +1,www.mongodb.com/docs/manual/installation/,55197 +2,www.mongodb.com/docs/atlas/getting-started/,45123 + diff --git a/get-docs-markdown/testdata/with-headers.csv b/get-docs-markdown/testdata/with-headers.csv new file mode 100644 index 0000000..9a908b5 --- /dev/null +++ b/get-docs-markdown/testdata/with-headers.csv @@ -0,0 +1,5 @@ +Rank,Page,Number of Page Views +1,www.mongodb.com/docs/manual/installation/,55197 +2,www.mongodb.com/docs/atlas/getting-started/,45123 +3,www.mongodb.com/docs/drivers/node/current/,32456 + diff --git a/get-docs-markdown/testdata/with-language-versions.csv b/get-docs-markdown/testdata/with-language-versions.csv new file mode 100644 index 0000000..b740639 --- /dev/null +++ b/get-docs-markdown/testdata/with-language-versions.csv @@ -0,0 +1,5 @@ +1,www.mongodb.com/docs/manual/installation/,55197 +2,www.mongodb.com/zh-cn/docs/manual/installation/,6297 +3,www.mongodb.com/pt-br/docs/manual/installation/,1879 +4,www.mongodb.com/docs/v7.0/administration/install-community/,1444 +