Skip to content

stdreher/New_DocXLS-Converter

Repository files navigation

Doc2Xls Studio - Technical Documentation

Project Overview

Doc2Xls Studio is a web application that converts Word document tables into structured Excel files based on a user-defined template. It extracts tables, form fields, and metadata from DOCX files and maps them to target Excel column structures.

For user documentation, see English | German (Deutsch).

Architecture

Stack

  • Frontend: React 19 + TypeScript + Vite
  • Backend: Express.js + TypeScript
  • Styling: Tailwind CSS v4 + shadcn/ui components
  • Document Processing: mammoth.js (DOCX parsing) + xlsx (Excel generation)

Directory Structure

├── client/                  # Frontend React application
│   ├── src/
│   │   ├── components/      # UI components
│   │   │   ├── ui/          # shadcn/ui base components
│   │   │   ├── FileUploader.tsx      # DOCX file upload component
│   │   │   ├── TemplateUploader.tsx  # Excel template upload component
│   │   │   ├── ColumnMapper.tsx      # Column mapping interface
│   │   │   └── DataPreview.tsx       # Data preview and export component
│   │   ├── pages/
│   │   │   └── Converter.tsx  # Main converter page (orchestrates workflow)
│   │   ├── hooks/            # Custom React hooks
│   │   ├── lib/              # Utility functions
│   │   ├── App.tsx           # Root component with routing
│   │   └── index.css         # Global styles and CSS variables
│   └── index.html            # HTML entry point
├── server/                   # Backend Express server
│   ├── index.ts              # Server entry point
│   ├── routes.ts             # API routes and document processing logic
│   ├── storage.ts            # Storage interface (for future DB integration)
│   ├── static.ts             # Static file serving
│   └── vite.ts               # Vite dev server integration
├── shared/                   # Shared types and schemas
│   └── schema.ts             # Drizzle ORM schemas (for future use)
└── package.json              # Dependencies and scripts

Core Features

1. Template Upload (Step 1)

  • User uploads an Excel template (.xlsx/.xls)
  • System extracts column headers from the template
  • Supports complex templates with multiple sheets
  • Auto-detects the header row based on cell density

2. Document Upload (Step 2) - Batch Processing

  • User uploads one or more Word documents (.docx/.doc)
  • Batch mode: Upload multiple files at once for combined processing
  • System parses documents using mammoth.js
  • Extracts multiple table types:
    • Data tables: Traditional row/column format
    • Form tables: Label-value pair format (e.g., "Test Case Type: Funktionstest")
    • Document metadata: Filename, summary text
  • Groups columns by source table with prefixes like [Table Name] Column
  • In batch mode, displays file list with add/remove controls and processing status

3. Column Mapping (Step 3)

  • Visual interface to map source columns to target template columns
  • Auto-mapping based on column name similarity
  • Supports mapping from any source table to any target column

4. Preview & Export (Step 4)

  • Shows transformed data preview before download
  • Batch mode: Displays combined data from all documents with source file column
  • Generates single merged Excel file with data from all processed documents
  • Preserves column order from template

API Endpoints

POST /api/parse-template

Parses an Excel template and extracts column headers.

Request: multipart/form-data with file field Response:

{
  "columns": ["Column1", "Column2", ...],
  "sheetName": "Sheet Name",
  "headerRowIndex": 5,
  "fileName": "template.xlsx",
  "allSheets": ["Sheet1", "Sheet2", ...]
}

POST /api/parse-document

Parses a single Word document and extracts all tables and metadata.

Request: multipart/form-data with file field Response:

{
  "tables": [
    {
      "name": "Document Info",
      "columns": ["Dateiname", "Kurze Zusammenfassung"],
      "rows": [{ "Dateiname": "...", "Kurze Zusammenfassung": "..." }]
    },
    {
      "name": "Testfall-Erstellung",
      "columns": ["Test Case Type", "Version/Release", ...],
      "rows": [{ ... }]
    }
  ],
  "fileName": "document.docx",
  "tablesFound": 5,
  "freeTexts": ["Summary text..."],
  "headerInfo": "Document title"
}

POST /api/parse-documents-batch

Batch parses multiple Word documents (up to 50 files).

Request: multipart/form-data with files field (array) Response:

{
  "documents": [
    {
      "fileName": "doc1.docx",
      "tables": [...],
      "tablesFound": 5
    },
    {
      "fileName": "doc2.docx",
      "tables": [...],
      "tablesFound": 4
    }
  ],
  "totalFiles": 2,
  "processedFiles": 2
}

POST /api/generate-excel

Generates an Excel file from mapped data.

Request:

{
  "targetColumns": ["Target Col 1", "Target Col 2", ...],
  "mapping": { "Target Col 1": "[Source Table] Source Col", ... },
  "sourceData": [{ "[Source Table] Source Col": "value", ... }]
}

Response: Binary Excel file (.xlsx)

Document Parsing Logic

Table Detection

The parser handles two table formats:

  1. Form-Style Tables: Tables with label-value pairs

    • Detected by cells ending with ":"
    • Extracts each label as a column name, next cell as value
    • Example: | Test Case Type: | Funktionstest |
  2. Data Tables: Traditional tabular data

    • First non-title row treated as headers
    • Subsequent rows are data records
    • Example: Standard tables with header row

Metadata Extraction

  • Filename: Extracted from document header or paragraph containing ".docx"
  • Summary Text: Paragraphs following "Kurze Zusammenfassung" header
  • Free Text: Non-table content between sections

HTML Parsing

mammoth.js converts DOCX to HTML, then custom regex-based parsing extracts:

  • <table> elements with all nested <tr>, <td>, <th>
  • <p> paragraphs for free text
  • Cell content cleaned of HTML tags, normalized whitespace

Frontend State Management

Converter Page State

interface State {
  step: 'template' | 'upload' | 'map' | 'preview';
  file: File | null;
  isProcessing: boolean;
  parsedData: ParsedData | null;  // Source document data
  templateData: TemplateData | null;  // Target template structure
  mapping: Record<string, string>;  // target -> source column mapping
}

Data Flow

  1. handleTemplateSelect → Parse template → Store templateData
  2. handleFileSelect → Parse document → Store parsedData
  3. ColumnMapper → User maps columns → Update mapping
  4. handleDownload → Generate Excel with getCombinedData() + mapping

Combined Data Generation

The getCombinedData() function merges data from all tables:

  • Main table (most rows) provides base records
  • Other tables' first row data is added to each record
  • All columns prefixed with [Table Name] for unique identification

Styling System

Design Tokens (CSS Variables)

Defined in client/src/index.css:

  • Colors: Primary blue (221 83% 53%), slate backgrounds
  • Typography: Inter (body), Space Grotesk (headings)
  • Spacing: 0.25rem base unit
  • Border radius: 0.5rem default

Component Styling

  • shadcn/ui components for consistent UI
  • Tailwind utilities for layout and spacing
  • Framer Motion for animations and transitions

Dependencies

Production

  • mammoth: DOCX to HTML conversion
  • xlsx: Excel file reading/writing
  • multer: File upload handling
  • react-dropzone: Drag-and-drop file uploads
  • framer-motion: Animations
  • @radix-ui/*: Accessible UI primitives

Development

  • vite: Build tool and dev server
  • tsx: TypeScript execution
  • tailwindcss: CSS framework
  • typescript: Type checking

Running the Application

# Development
npm run dev

# Production build
npm run build
npm start

The application runs on port 5000 (configurable via PORT env variable).

Future Improvements

  1. Database Integration: Store templates and conversion history using Drizzle ORM
  2. Template Management: Save/load custom Excel templates
  3. Batch Processing: Convert multiple documents at once
  4. Advanced Mapping: Rule-based transformations and data validation
  5. Authentication: User accounts for saved preferences

Releases

No releases published

Packages

No packages published

Languages