Web Server Log Analysis using Hadoop

Welcome to this comprehensive guide on setting up a big data pipeline for web server log analysis using Apache Hadoop, Flume, Spark, and Hive, with visualization in Jupyter Notebook.
This tutorial covers installation, configuration, data ingestion, processing, querying, and visualization of web server logs (e.g., NASA HTTP access logs) to extract insights like:
- Top URLs
- Top IP addresses
- Status code distribution
- Daily website hits
We’ll use Linux (Arch and Ubuntu) in pseudo-distributed mode for simplicity, with the following versions:
- Hadoop 3.3.6
- Flume 1.11.0
- Spark 3.5.7
- Hive 3.1.3
Pro Tip: You could spin up a virtual machine with everything pre-installed and ready to go. But where’s the fun in that? 😉 Nothing beats rolling up your sleeves, installing piece by piece, and watching the magic happen.
📌 Prerequisites
- Linux machine (Arch or Ubuntu)
- Basic terminal and bash familiarity
- At least 10GB disk space
- Internet access for downloads
- Python 3 + pip installed (for Jupyter)
1. Installing and Setting Up Hadoop
Hadoop provides distributed storage (HDFS) and processing. We'll configure it in pseudo-distributed mode.
Step 1: Install Java JDK 8
# Ubuntu
sudo apt install openjdk-8-jdk
# Arch
sudo pacman -S jdk8-openjdk
Verify:
java -version
Step 2: Configure Environment Variables
~/.bashrc
Paste the following-
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-8-openjdk-amd64/bin
export HADOOP_HOME=~/bigdata/hadoop-3.3.6/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh
Apply changes:
source ~/.bashrc
Step 3: Install SSH and Download Hadoop
# Ubuntu
sudo apt-get install ssh
# Arch
sudo pacman -S ssh
Download and extract Hadoop:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz -P ~/Downloads tar -zxvf ~/Downloads/hadoop-3.3.6.tar.gz -C ~/bigdata
Edit hadoop-env.sh:
cd ~/bigdata/hadoop-3.3.6/etc/hadoop
nano hadoop-env.sh
# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 4: Configure Hadoop XML Files
Edit the following XML files in ~/bigdata/hadoop-3.3.6/etc/hadoop/:
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<!-- Proxy users for multi-component access -->
<property>
<name>hadoop.proxyuser.$USER.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.$USER.groups</name>
<value>*</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/$USER/bigdata/hadoop-3.3.6/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/$USER/bigdata/hadoop-3.3.6/data/datanode</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Step 5: Setup SSH Keys
Why SSH? Hadoop communicates securely with itself and other nodes. Even in pseudo-distributed mode, SSH is your friend.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost # Test SSH
Step 6: Format HDFS and Start Hadoop
export PDSH_RCMD_TYPE=ssh
~/bigdata/hadoop-3.3.6/bin/hdfs namenode -format
start-all.sh # To start Hadoop daemons
stop-all.sh # To stop when needed
Verify via HDFS UI: http://localhost:9870
Create a user directory:
hdfs dfs -mkdir -p /user/$USER
2. Installing and Configuring Apache Flume
Flume ingests log files into HDFS.
Step 1: Download Flume
wget https://downloads.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz -P ~/bigdata tar -xzf ~/bigdata/apache-flume-1.11.0-bin.tar.gz -C ~/bigdata
Step 2: Set Environment Variables
Add to ~/.bashrc:
export FLUME_HOME=~/bigdata/apache-flume-1.11.0-bin
export PATH=$PATH:$FLUME_HOME/bin
source ~/.bashrc.
Step 3: Configure Flume
Create ~/bigdata/config/flume.conf:
agent.sources = log-source
agent.channels = memory-channel
agent.sinks = hdfs-sink
# Source: Monitor log directory
agent.sources.log-source.type = spooldir
agent.sources.log-source.spoolDir = /home/$USER/bigdata/logs
agent.sources.log-source.fileHeader = true
# Channel: In-memory
agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity = 1000
# Sink: Write to HDFS
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:9000/user/$USER/logs
agent.sinks.hdfs-sink.hdfs.filePrefix = web-log
agent.sinks.hdfs-sink.hdfs.rollInterval = 300
agent.sinks.hdfs-sink.hdfs.rollSize = 10485760
agent.sinks.hdfs-sink.hdfs.rollCount = 0
# Connect components
agent.sources.log-source.channels = memory-channel
agent.sinks.hdfs-sink.channel = memory-channel
Step 4: Prepare HDFS Directory
hdfs dfs -mkdir -p /user/$USER/logs
3. Ingesting Data with Flume
Step 1: Download Sample Logs
We'll use NASA access logs:
wget ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz -P ~/bigdata/logs
gunzip ~/bigdata/logs/NASA_access_log_Jul95.gz
mv ~/bigdata/logs/NASA_access_log_Jul95 ~/bigdata/logs/access.log
Step 2: Run Flume Agent
Grant permissions and start:
hdfs dfs -chmod -R 777 /
flume-ng agent --conf ~/bigdata/config --conf-file ~/bigdata/config/flume.conf --name agent -Dflume.root.logger=INFO,console
Verify ingestion:
hdfs dfs -ls /user/$USER/logs
hdfs dfs -cat /user/$USER/logs/web-log.* | head
You should see log entries like:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
4. Installing and Configuring Apache Spark
Spark processes the ingested logs.
Step 1: Download Spark
wget https://downloads.apache.org/spark/spark-3.5.7/spark-3.5.7-bin-hadoop3.tgz -P ~/bigdata tar -xzf ~/bigdata/spark-3.5.7-bin-hadoop3.tgz -C ~/bigdata
Step 2: Set Environment Variables
Add to ~/.bashrc:
export SPARK_HOME=~/bigdata/spark-3.5.7-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
source ~/.bashrc
Step 3: Configure Spark
Create ~/bigdata/spark-3.5.7-bin-hadoop3/conf/spark-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # Adjust for Arch if needed
export HADOOP_HOME=~/bigdata/hadoop-3.3.6
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077
Verify:
spark-submit --version
Step 4: Create Spark Processing Script
Create ~/bigdata/log_analysis.py.
Get it here.
This script parses the logs, computes key metrics, and saves them as Parquet files.
Step 5: Run Spark Job
spark-submit --master local[*] ~/bigdata/log_analysis.py --input hdfs://localhost:9000/user/$USER/logs/* --output hdfs://localhost:9000/user/$USER/logs_output_full
Verify:
hdfs dfs -ls /user/$USER/logs_output_full
5. Installing and Configuring Hive
Hive allows SQL-like querying on the processed data.
Step 1: Download Hive
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz -P ~/bigdata tar -xzf ~/bigdata/apache-hive-3.1.3-bin.tar.gz -C ~/bigdata
Step 2: Set Environment Variables
Add to ~/.bashrc:
export HIVE_HOME=~/bigdata/apache-hive-3.1.3-bin
export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc
Step 3: Configure Hive
Create ~/bigdata/apache-hive-3.1.3-bin/conf/hive-site.xml:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/$USER/bigdata/hive-3.1.3/metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
</property>
</configuration>
Step 4: Initialize Hive
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive
schematool -initSchema -dbType derby
Step 5: Start Hive and Query
Start server:
hiveserver2 &
Connect with Beeline:
beeline -u jdbc:hive2://localhost:10000
Create database and external tables:
CREATE DATABASE log_analysis;
USE log_analysis;
CREATE EXTERNAL TABLE top_urls (url STRING, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/top_urls';
CREATE EXTERNAL TABLE top_ips (ip STRING, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/top_ips';
CREATE EXTERNAL TABLE status_counts (status STRING, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/status_counts';
CREATE EXTERNAL TABLE daily_hits (`date` DATE, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/daily_hits';
Query example (top 10 URLs):
SELECT url, count FROM top_urls ORDER BY count DESC LIMIT 10;
Export to CSV:
beeline -u jdbc:hive2://localhost:10000 --outputformat=csv2 -e \
"USE log_analysis; SELECT url, count FROM top_urls ORDER BY count DESC LIMIT 10;" > ~/bigdata/top_urls.csv
Repeat for other tables.
6. Visualizing the Results
To make insights more accessible, visualize the queried data using Python and Matplotlib. Export the CSVs as above, then create a visualization script.
Step 1: Install Required Libraries (if needed)
Ensure Pandas and Matplotlib are available (they come with many Python environments; install via
pip install pandas matplotlib
Step 2: Create Visualization Script
Create ~/bigdata/visualize_logs.py:
import pandas as pd
import matplotlib.pyplot as plt
# Load CSVs
top_urls = pd.read_csv('~/bigdata/top_urls.csv', header=None, names=['url', 'count'])
status_counts = pd.read_csv('~/bigdata/status_counts.csv', header=None, names=['status', 'count'])
daily_hits = pd.read_csv('~/bigdata/daily_hits.csv', header=None, names=['date', 'count'])
# Plot Top URLs (Bar Chart)
plt.figure(figsize=(10, 6))
plt.barh(top_urls['url'][:10], top_urls['count'][:10])
plt.xlabel('Count')
plt.ylabel('URL')
plt.title('Top 10 URLs by Hits')
plt.gca().invert_yaxis()
plt.savefig('~/bigdata/top_urls.png')
plt.show()
# Plot Status Counts (Pie Chart)
plt.figure(figsize=(8, 8))
plt.pie(status_counts['count'], labels=status_counts['status'], autopct='%1.1f%%')
plt.title('HTTP Status Code Distribution')
plt.savefig('~/bigdata/status_counts.png')
plt.show()
# Plot Daily Hits (Line Chart)
daily_hits['date'] = pd.to_datetime(daily_hits['date'])
plt.figure(figsize=(12, 6))
plt.plot(daily_hits['date'], daily_hits['count'])
plt.xlabel('Date')
plt.ylabel('Hits')
plt.title('Daily Website Hits')
plt.xticks(rotation=45)
plt.savefig('~/bigdata/daily_hits.png')
plt.show()
Step 3: Run the Script
python ~/bigdata/visualize_logs.py
This generates PNG files for each visualization. You can embed them in reports or dashboards. For more advanced visuals, consider integrating with tools like Tableau or using Spark's built-in plotting in notebooks (e.g., via Jupyter with PySpark).
This completes the pipeline! You've now ingested, processed, queried, and visualized web logs using a robust big data ecosystem. If you encounter issues, check logs in $HADOOP_HOME/logs or Flume console output. Happy analyzing!