This project aims to collect data from Sri Lanka's Disaster Management Center's official website's river bain water level reports. The project is inspired by Nuwan Senarathne's original work. While Nuwan Senarathne's work is tremendous, I felt the need for providing a clean dataset for analysts, climate experts through a form of CSV as the original data is in PDF reports which could be tedious and daunting to extract data from if chose to do it manually.
- URL - DMC
- Format - PDF
As this was a rapid prototype built fast, the initial version requires to source data from the DMC's website. In other words, manual download of reports are required. Refer to the future works section below to learn what's awaiting next for this project.
- Move the downloaded PDF reports to
Data/PDFfolder. - In the project root directory, run command
python3 bulk_process.pyusing your terminal of choice. (bulk_process.pyscript utilizes "extract_pdf.py", "parse_json.py" scripts to ingest data to the duckdb database)
extract_pdf.pywill scan for all the PDFs in the "Data/PDF" folder and extract their table's (within the PDF) and convert it to a JSON file and store in "Data/JSON" folder. For this,camelotlibrary is used which is an excellent library to extract PDF tables.- Next,
parse_json.pywill extract the contents, clean them, and insert into the duckdb database.
Derived Attributes
The original PDF report consists of dynamically changing columns such as,
- "Water Level at xxxx"
- "X HR RF in mm at xxxx"
Therefore, the following derived columns are introduced:
- last hour reported water level
- lsat hour water level difference
- rainfall in mm
- rainfall hour interval
Through these derived attributes, we can make all the incident report data,
- Normalized with one uniform measure
- Centralized for better analytical reporting (for example, we can derive metrics like "average hourly rainfall in mm", etc.)
If you are familiar with SQL, all you have to do is, use a database manager like DBeaver
- Get incidents for a specific date
SELECT *
FROM flood_db.incidents.incidents_report
WHERE report_date = "20251130"- Get incidents for a date range
SELECT *
FROM flood_db.incidents.incidents_report
WHERE report_date BETWEEN "20251130" AND "20251206"- Get incidents of a given gauging station on a given date
SELECT *
FROM flood_db.incidents.incidents_report
WHERE report_date = "20251130"
AND gauging_station = "Nagalagam Street"If you need an API endpoint to integrate this data to your own application/ project, that is also possible. A "GET" endpoint is enabled for retrieving data through query parameters for river, river basin, dates and gauging station.
Steps
- First you need to have Docker on your machine.
- Next up go to the
image/directory. - Build the docker image from Dockerfile available:
docker build -t flood-api:latest .- Now run the image
docker run -p 9999:9999 flood-api:latestTo run in the detached mode:
docker run -d -p 9999:9999 flood-api:latestOn your web browser, head to the http://0.0.0.0:9999/docs to interact with the API responses using the Swagger UI.
import requests
base_url = "http://0.0.0.0:9999/water-level/"
def main(base_url: str, params_i: dict):
res = requests.get(base_url, params=params_i)
if res.ok:
print(res.status_code)
return res.json()
if __name__ == "__main__":
parameters = {
"date": "20251130",
"river": "Gurugoda Oya"
}
res = main(base_url, parameters)
print(res)Currently the focus is on providing a clean dataset for analytical purposes which at the base level, it excels at. However, for a more comprehensive, easy accessible for everyone, the following future work are to be carried out:
- Extraction of PDF report data through IO bytes rather than storing in local storage.
- Storing raw json converted report data (before deriving further attributes) in MongoDB documents for persistance.
- A click group for batch backfilling and target backfilling extraction of data.
- A plotly dash app for dynamically filtering data through a UI and downloading to CSV formats which provides access to everyone without needing specific technical knowledge.
- Deploying to Digitial Ocean droplet and enable ochestration for productionizing.