This project is a full-stack application (built using Groovy, Grails, Docker, MySQL, Groq, TypeScript and Svelte) that extracts company incorporations published in the BORME (Boletín Oficial del Registro Mercantil) from the official website of the Boletín Oficial del Estado (BOE).
Given a specific date, the system:
The official BORME publications are available at https://www.boe.es/borme/dias/YYYY/MM/DD.
| Method | Path | Description |
|---|---|---|
| GET | /constitutions?date=<date> |
Returns company incorporations for the given date. If entries already exist in the database, they are returned immediately. Otherwise, the system processes, persists and returns them. |
| GET | /constitutions/<identifier> |
Returns detailed information for a specific company incorporation, including structured JSON and links to the original PDF. |
Install Node.js 22.20.0 or higher:
sudo apt update
sudo apt install -y curl
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
Have Java Development Kit (JDK) 11 installed on your machine
Have Docker installed on your machine
Have a Docker Hub account
Clone the repository:
git clone https://github.com/antonioalanxs/scraping-BORME
cd scraping-BORME
Copy .env.example to .env and configure your environment:
BORME_URI=https://www.boe.es/borme/dias
LLM_URI=llm_uri_placeholder
LLM_API_KEY=llm_api_key_placeholder
LLM_MODEL=llm_model_placeholder
LLM_MAXIMUM_THREAD_REQUESTS=llm_maximum_thread_requests_placeholder_if_needed
DATABASE_NAME=database_name_placeholder
DATABASE_USER=database_user_placeholder
DATABASE_PASSWORD=database_password_placeholder
DATABASE_ROOT_USER=database_root_user_placeholder
DATABASE_ROOT_PASSWORD=database_root_password_placeholder
CORS_URL_PATTERN=cors_url_pattern_placeholder
CORS_ALLOWED_ORIGINS=cors_allowed_origins_placeholder
CORS_ALLOWED_METHODS=cors_allowed_methods_placeholder
CORS_ALLOWED_HEADERS=cors_allowed_headers_placeholder
CORS_ALLOW_CREDENTIALS=cors_allow_credentials_placeholder
VITE_API_URI=http://localhost:8080/api
Go to the Docker development directory and Start the database service using the up.sh script:
cd scraping-BORME/docker/development
./up.sh
Backend - navigate to the backend directory, install dependencies and run the server:
cd ../../backend
./gradlew assemble
./gradlew bootRun
The backend API will be available at http://localhost:8080/api.
cd ../frontend
npm install
npm run dev
The frontend interface will be available at http://localhost:5173.
The project currently uses a Language Model API via Groq to transform raw legal text from BORME PDFs into structured JSON, performing tasks such as extracting incorporation details, normalizing dates and names, and converting unstructured text into a format suitable for database storage and API consumption.
Example input:
389826 - MEDITERRAMOVING SL.
Constitución. Comienzo de operaciones: 9.07.25. Objeto social: Transporte de mercancías, mudanzas, almacenajes de
mercancías. Guardamuebles, logística. Servicio de embalaje. Alquiler de vehículos con y sin conductor. Importación y
exportación de mercancías. Todo tipo de actividades relacionadas con trabajos de albañilería, reformas y mantenimiento
de comunidades. Domicilio: PTDA DE ALZABARES BAJO 46 (ELCHE). Capital: 3.000,00 Euros. Declaración de unipersonalidad.
Socio único: VIDAL RICO RICARDO. Nombramientos. Adm. Unico: VIDAL RICO RICARDO. Datos registrales. S 8 , H A 199888,
I/A 1 (25.08.25).
Example output:
{
"entity": {
"code": "389826",
"name": "MEDITERRAMOVING, S. L."
},
"start_of_operations": "2025-07-09",
"social_object": "Transporte, logística y almacenaje de mercancías, mudanzas, alquiler de vehículos, embalaje y trabajos de albañilería, reformas y mantenimiento.",
"address": "PTDA DE ALZABARES BAJO 46 (ELCHE)",
"capital": 3.000,
"single_person_declaration": true,
"sole_partner": "Vidal Rico, Ricardo",
"appointment": {
"sole_administrator": "Vidal Rico, Ricardo"
},
"registry": {
"section": "8",
"page": "199888",
"entry": "1",
"date": "2025-08-25"
}
}
Originally, the project was designed to run a local language model via Ollama inside Docker. Due to resource constraints, it now uses the Groq API. Free-tier limitations may restrict the number of requests per execution thread, so some entries may be skipped.
Example backend log showing skipped entries due to request limits:
[com.example.LanguageModelService] Target '389831 - BOLUCAMVA SL. Constitución. Comienzo de operaciones: 14.07.25. Objeto social: 1. Actividades de gestión de fondos propios con exclusión de fondos ajenos. 2. Actividades de apoyo a la agricultura. 3. Alquiler de bienes inmobiliarios por cuenta propia. 4. Silvicultura y otras actividades forestales. Domicilio: C/ L'ESPART 5 (ALTEA). Capital: 3.000,00 Euros. Declaración de unipersonalidad. Socio único: CAMPOMANES EGUIGUREN LUIS. Nombramientos. Adm. Unico: CAMPOMANES EGUIGUREN LUIS. Datos registrales. S 8 , H A 199939, I/A 1 (25.08.25). ' omitted because the maximum requests per execution has been reached for thread 'qtp99823907-48'
[com.example.LanguageModelService] Target '389835 - TRIUNFO IBERICO SL. Constitución. Comienzo de operaciones: 30.07.25. Objeto social: Restaurantes y puestos de comidas. Domicilio: C/ HERNAN CORTES 3 (SAN VICENTE DEL RASPEIG). Capital: 3.000,00 Euros. Nombramientos. Adm. Unico: LILLO MARTINEZ VICENTE. Datos registrales. S 8 , H A 199944, I/A 1 (25.08.25). ' omitted because the maximum requests per execution has been reached for thread 'qtp99823907-48'
...
[com.example.LanguageModelServiceRequestCleanerFilter] Cleared ThreadLocal for request '/api/constitutions'
At present, the endpoint GET /constitutions?date=<date> handles the entire workflow — downloading PDFs, extracting text, calling the language model multiple times and persisting results — within a single HTTP request. While this approach works for demonstration purposes, it is not optimized for production environments, especially when processing large numbers of incorporations.
Ideally, this workflow would be offloaded to an asynchronous background job. In such a setup, the API endpoint would first check the database and immediately return existing results. If data for the requested date is not available, a background job would be triggered to perform the PDF downloads, parsing, LLM processing and database persistence. The client could then poll the API or subscribe to updates until the results are ready.
This improvement has not been implemented in the current version due to time constraints and the scope of this project. Nevertheless, the codebase has been structured so that integrating background jobs in the future would be straightforward.
If you would like to contribute to the project, please follow these steps:
git checkout -b feature/new-feature)git commit -m 'feat: add new feature' or fix: correct a bug)git push origin feature/new-feature)This project is licensed under the Apache License 2.0.