This program reads AWS VPC flow logs and tags each log entry based on its destination port and protocol. It uses a lookup table to match (dstport, protocol) combinations to tags, then generates a summary report that includes:
- The number of occurrences of each tag.
- How often each port/protocol combination appears.
- The number of flow logs that couldn't be matched to any tag (untagged).
- The number of skipped or malformed flow log lines.
This program is designed to handle:
- Flow log files up to 10 MB in size.
- Lookup tables with up to 10,000 mappings.
- Multiple tags for a single port/protocol combination.
- Case-insensitive matching for both protocols and tags.
- Flow Log Format:
- The program only supports version 2 of AWS VPC flow logs, based on the structure documented in AWS Flow Logs Documentation.
- Only flow logs in the default format are supported.
- Lines with fewer than 8 fields are treated as malformed and skipped.
- Lookup Table:
- The lookup table is a CSV file with three columns:
dstport, protocol, tag. - Each port/protocol combination can map to multiple tags.
- The lookup table uses lowercase protocols (
tcp,udp,icmp), but matching is case-insensitive.
- The lookup table is a CSV file with three columns:
- Case Insensitivity:
- Both protocol matching and tag lookup are case-insensitive. Tags are stored and counted in lowercase to avoid any mismatches.
- Performance:
- The program reads the flow logs line by line, making it memory efficient and capable of handling large files up to 10 MB.
- The lookup table is stored in memory using a dictionary, which efficiently handles up to 10,000 mappings.
- Python 3.x: No external libraries or dependencies are required. The program uses built-in Python modules, so it can be run directly on any system with Python installed.
- Clone or Download the Repository:
- Clone the repository from GitHub or download the source files:
bash git clone https://github.com/MarkedMuichiro/Flow_Log_Parsing_Program.git cd flow-log-parser - Prepare Input Files:
- Ensure the following input files are present in the working directory:
flow_logs.txt: A plain-text file containing the flow logs (up to 10 MB).lookup.csv: A CSV file with three columns:dstport,protocol,tag.
- Run the Program:
- Run the program using Python from the command line:
python flow_log_parser.py
The program generates an output file named output.txt, which contains the results of the flow log processing. This file is located in the same directory as the script.
- Valid Flow Logs:
The program was tested with flow logs containing various destination ports and protocols (TCP, UDP, ICMP) to ensure proper matching and counting of tags.
- Malformed Flow Logs:
Lines with fewer than 8 fields were tested to ensure they are properly skipped, and the count of malformed lines is correctly reported in the output.
- Multiple Tags per Port/Protocol:
The lookup table was tested with multiple tags assigned to the same port and protocol combination. The program correctly counted each tag when the matching port/protocol combination appeared in the flow logs.
- Case Insensitive Matching:
Flow logs and lookup table entries with varying cases (e.g.,
TCPvstcp) were tested to ensure the program handled case insensitivity correctly, matching regardless of the case used. - Large Files:
The program was tested with a flow log file of 10 MB in size and a lookup table containing 10,000 entries to confirm it could process large amounts of data efficiently without exceeding memory limits.
- Efficiency:
The program processes the flow logs line by line, making it scalable for large files. Memory usage is kept minimal since only the lookup table is stored in memory as a dictionary, allowing efficient lookups (average O(1) complexity).
- Error Handling:
The program gracefully handles malformed flow log lines, skipping them and providing a count of how many lines were skipped. This ensures robustness when working with potentially messy log files.
- Extensibility:
The program is easily extendable. If additional protocols need to be supported or if more complex tag mappings are required, the lookup structure and logic can