Doc2Xls Studio - Technical Documentation

Project Overview

Doc2Xls Studio is a web application that converts Word document tables into structured Excel files based on a user-defined template. It extracts tables, form fields, and metadata from DOCX files and maps them to target Excel column structures.

For user documentation, see English | German (Deutsch).

Architecture

Stack

Frontend: React 19 + TypeScript + Vite
Backend: Express.js + TypeScript
Styling: Tailwind CSS v4 + shadcn/ui components
Document Processing: mammoth.js (DOCX parsing) + xlsx (Excel generation)

Directory Structure

├── client/                  # Frontend React application
│   ├── src/
│   │   ├── components/      # UI components
│   │   │   ├── ui/          # shadcn/ui base components
│   │   │   ├── FileUploader.tsx      # DOCX file upload component
│   │   │   ├── TemplateUploader.tsx  # Excel template upload component
│   │   │   ├── ColumnMapper.tsx      # Column mapping interface
│   │   │   └── DataPreview.tsx       # Data preview and export component
│   │   ├── pages/
│   │   │   └── Converter.tsx  # Main converter page (orchestrates workflow)
│   │   ├── hooks/            # Custom React hooks
│   │   ├── lib/              # Utility functions
│   │   ├── App.tsx           # Root component with routing
│   │   └── index.css         # Global styles and CSS variables
│   └── index.html            # HTML entry point
├── server/                   # Backend Express server
│   ├── index.ts              # Server entry point
│   ├── routes.ts             # API routes and document processing logic
│   ├── storage.ts            # Storage interface (for future DB integration)
│   ├── static.ts             # Static file serving
│   └── vite.ts               # Vite dev server integration
├── shared/                   # Shared types and schemas
│   └── schema.ts             # Drizzle ORM schemas (for future use)
└── package.json              # Dependencies and scripts

Core Features

1. Template Upload (Step 1)

User uploads an Excel template (.xlsx/.xls)
System extracts column headers from the template
Supports complex templates with multiple sheets
Auto-detects the header row based on cell density

2. Document Upload (Step 2) - Batch Processing

User uploads one or more Word documents (.docx/.doc)
Batch mode: Upload multiple files at once for combined processing
System parses documents using mammoth.js
Extracts multiple table types:
- Data tables: Traditional row/column format
- Form tables: Label-value pair format (e.g., "Test Case Type: Funktionstest")
- Document metadata: Filename, summary text
Groups columns by source table with prefixes like [Table Name] Column
In batch mode, displays file list with add/remove controls and processing status

3. Column Mapping (Step 3)

Visual interface to map source columns to target template columns
Auto-mapping based on column name similarity
Supports mapping from any source table to any target column

4. Preview & Export (Step 4)

Shows transformed data preview before download
Batch mode: Displays combined data from all documents with source file column
Generates single merged Excel file with data from all processed documents
Preserves column order from template

API Endpoints

POST `/api/parse-template`

Parses an Excel template and extracts column headers.

Request: multipart/form-data with file field Response:

{
  "columns": ["Column1", "Column2", ...],
  "sheetName": "Sheet Name",
  "headerRowIndex": 5,
  "fileName": "template.xlsx",
  "allSheets": ["Sheet1", "Sheet2", ...]
}

POST `/api/parse-document`

Parses a single Word document and extracts all tables and metadata.

Request: multipart/form-data with file field Response:

{
  "tables": [
    {
      "name": "Document Info",
      "columns": ["Dateiname", "Kurze Zusammenfassung"],
      "rows": [{ "Dateiname": "...", "Kurze Zusammenfassung": "..." }]
    },
    {
      "name": "Testfall-Erstellung",
      "columns": ["Test Case Type", "Version/Release", ...],
      "rows": [{ ... }]
    }
  ],
  "fileName": "document.docx",
  "tablesFound": 5,
  "freeTexts": ["Summary text..."],
  "headerInfo": "Document title"
}

POST `/api/parse-documents-batch`

Batch parses multiple Word documents (up to 50 files).

Request: multipart/form-data with files field (array) Response:

{
  "documents": [
    {
      "fileName": "doc1.docx",
      "tables": [...],
      "tablesFound": 5
    },
    {
      "fileName": "doc2.docx",
      "tables": [...],
      "tablesFound": 4
    }
  ],
  "totalFiles": 2,
  "processedFiles": 2
}

POST `/api/generate-excel`

Generates an Excel file from mapped data.

Request:

{
  "targetColumns": ["Target Col 1", "Target Col 2", ...],
  "mapping": { "Target Col 1": "[Source Table] Source Col", ... },
  "sourceData": [{ "[Source Table] Source Col": "value", ... }]
}

Response: Binary Excel file (.xlsx)

Document Parsing Logic

Table Detection

The parser handles two table formats:

Form-Style Tables: Tables with label-value pairs
- Detected by cells ending with ":"
- Extracts each label as a column name, next cell as value
- Example: | Test Case Type: | Funktionstest |
Data Tables: Traditional tabular data
- First non-title row treated as headers
- Subsequent rows are data records
- Example: Standard tables with header row

Metadata Extraction

Filename: Extracted from document header or paragraph containing ".docx"
Summary Text: Paragraphs following "Kurze Zusammenfassung" header
Free Text: Non-table content between sections

HTML Parsing

mammoth.js converts DOCX to HTML, then custom regex-based parsing extracts:

<table> elements with all nested <tr>, <td>, <th>
<p> paragraphs for free text
Cell content cleaned of HTML tags, normalized whitespace

Frontend State Management

Converter Page State

interface State {
  step: 'template' | 'upload' | 'map' | 'preview';
  file: File | null;
  isProcessing: boolean;
  parsedData: ParsedData | null;  // Source document data
  templateData: TemplateData | null;  // Target template structure
  mapping: Record<string, string>;  // target -> source column mapping
}

Data Flow

handleTemplateSelect → Parse template → Store templateData
handleFileSelect → Parse document → Store parsedData
ColumnMapper → User maps columns → Update mapping
handleDownload → Generate Excel with getCombinedData() + mapping

Combined Data Generation

The getCombinedData() function merges data from all tables:

Main table (most rows) provides base records
Other tables' first row data is added to each record
All columns prefixed with [Table Name] for unique identification

Styling System

Design Tokens (CSS Variables)

Defined in client/src/index.css:

Colors: Primary blue (221 83% 53%), slate backgrounds
Typography: Inter (body), Space Grotesk (headings)
Spacing: 0.25rem base unit
Border radius: 0.5rem default

Component Styling

shadcn/ui components for consistent UI
Tailwind utilities for layout and spacing
Framer Motion for animations and transitions

Dependencies

Production

mammoth: DOCX to HTML conversion
xlsx: Excel file reading/writing
multer: File upload handling
react-dropzone: Drag-and-drop file uploads
framer-motion: Animations
@radix-ui/*: Accessible UI primitives

Development

vite: Build tool and dev server
tsx: TypeScript execution
tailwindcss: CSS framework
typescript: Type checking

Running the Application

# Development
npm run dev

# Production build
npm run build
npm start

The application runs on port 5000 (configurable via PORT env variable).

Future Improvements

Database Integration: Store templates and conversion history using Drizzle ORM
Template Management: Save/load custom Excel templates
Batch Processing: Convert multiple documents at once
Advanced Mapping: Rule-based transformations and data validation
Authentication: User accounts for saved preferences

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
attached_assets		attached_assets
client		client
script		script
server		server
shared		shared
.gitignore		.gitignore
.replit		.replit
README.md		README.md
Readme_eng.md		Readme_eng.md
Readme_ger.md		Readme_ger.md
components.json		components.json
drizzle.config.ts		drizzle.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
replit.md		replit.md
tech_readme.md		tech_readme.md
tsconfig.json		tsconfig.json
vite-plugin-meta-images.ts		vite-plugin-meta-images.ts
vite.config.ts		vite.config.ts

stdreher/New_DocXLS-Converter

Folders and files

Latest commit

History

Repository files navigation

Doc2Xls Studio - Technical Documentation

Project Overview

Architecture

Stack

Directory Structure

Core Features

1. Template Upload (Step 1)

2. Document Upload (Step 2) - Batch Processing

3. Column Mapping (Step 3)

4. Preview & Export (Step 4)

API Endpoints

POST /api/parse-template

POST /api/parse-document

POST /api/parse-documents-batch

POST /api/generate-excel

Document Parsing Logic

Table Detection

Metadata Extraction

HTML Parsing

Frontend State Management

Converter Page State

Data Flow

Combined Data Generation

Styling System

Design Tokens (CSS Variables)

Component Styling

Dependencies

Production

Development

Running the Application

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

POST `/api/parse-template`

POST `/api/parse-document`

POST `/api/parse-documents-batch`

POST `/api/generate-excel`

Packages