2024 A Knowledge Query and Retrieval System for Architectural Regulations Based on RAG :

1-1. Load HTML context.

1-2. Match CHS-ENG translations.

1-3. Analyze and output the term list.

Module 1 -
HTML Fetcher

Module 2 -
PDF Extractor

2-1. Load PDF document.

2-2. Perform paragraph analysis
based on the term library.

2-3. Extract keywords and output JSON data.

3-1. Load JSON data.

3-2. Tabulate the data.

3-3. Store in SQL database.

Module 3 -
JSON Processor

Module 4 -
AI Helper

4-1. User inputs a question.

4-2. Select historical key terms or Chat records.

4-3. AI will query from the database
and generate an answer.

5-1. Record user question-answer content.

5-2. Store in JSON and edit it
on WPF app locally.

5-3. Upload to PostgreSQL database
for future extraction or fine-tuning.

Module 5 -
Chat Logger

- Summary -

Current AI document assistants can only provide outlines or broad knowledge queries.
They lack the ability to deliver precise answers specific to professional fields or uploaded documents.
This system enhances retrieval accuracy by targeting professional terms and phrases gathered via Module 1 (HTML Fetcher) from websites like Archdaily, OED, and HKBD.
Using Module 2 (PDF Extractor), this system analyzes and organizes building regulation PDFs into structured data using terms collected previously.
The analyzed JSON data is then stored in an SQL database via Module 3 (JSON Processor).
The Module 4 (AI Helper) utilizes this data with commercial models to perform systematic Q&A tasks.
Lastly, Module 5 (Chat Logger) saves interaction history into local JSON files or an online PostgreSQL database, enabling later extraction for message review or fine-tuning.

- Detailed Explanation of the Technical Scheme -

The system integrates five core modules working together to form a cohesive solution :

1. HTML Fetcher (Professional Term Collection Module)
. Function : Scrapes professional terms and phrases from targeted websites (e.g., Archdaily, OED, HKBD) following specified rules to build a domain-specific glossary.
. Output : JSON data of professional terms for subsequent modules.

2. PDF Extractor (Document Parsing Module)
. Function : Extracts core semantic information from user-uploaded PDFs (e.g., building regulation documents) based on the glossary.
. Output : JSON-structured analysis results.

3. JSON Processor (Data Storage Module)
. Function : Converts parsed JSON data into normalized tables stored in an SQL database for quick retrieval and further analysis.
. Output : A structured SQL database.

4. AI Helper (Intelligent Q&A Module)
. Function : Provides precise and domain-specific answers using stored data and commercial models.
. Output : Interactive Q&A results.

5. Chat Logger (Interaction Recording Module)
. Function : Logs interaction history in JSON or PostgreSQL formats for retrieval or model fine-tuning.
. Output : Stored dialogue data as JSON files or database tables.

- Comparison with Existing Technologies -

Limitations of Current Technologies :

Rely on broad corpus pretraining, with limited specialization for specific fields.
Lack deep document analysis capabilities for user-uploaded materials.
Inefficient handling of historical interaction data for iterative model improvement.

Advantages of the Tool :

Enhanced Accuracy : Professional glossary creation using HTML Fetcher improves term relevance and retrieval precision.
In-depth Analysis : PDF Extractor extracts and organizes vital semantic data into an SQL database.
Data Management : Enables efficient data retrieval and ensures long-term database usability.
Systematic Q&A : Combines custom and commercial models for reliable domain-specific responses.
Historical Record Use : Logs and archives data for refinement and model improvement.

Conclusion :
The system / workflow optimizes professional domain functionality, outperforming generic solutions by providing specialized data processing and intelligent systemized outputs.