Web Server Log Analysis using Hadoop

Oct 2, 20253 min read

Welcome to this comprehensive guide on setting up a big data pipeline for web server log analysis using Apache Hadoop, Flume, Spark, and Hive, with visualization in Jupyter Notebook.

This tutorial covers installation, configuration, data ingestion, processing, querying, and visualization of web server logs (e.g., NASA HTTP access logs) to extract insights like:

Top URLs
Top IP addresses
Status code distribution
Daily website hits

We’ll use Linux (Arch and Ubuntu) in pseudo-distributed mode for simplicity, with the following versions:

Hadoop 3.3.6
Flume 1.11.0
Spark 3.5.7
Hive 3.1.3

Pro Tip: You could spin up a virtual machine with everything pre-installed and ready to go. But where’s the fun in that? 😉 Nothing beats rolling up your sleeves, installing piece by piece, and watching the magic happen.

📌 Prerequisites

Linux machine (Arch or Ubuntu)
Basic terminal and bash familiarity
At least 10GB disk space
Internet access for downloads
Python 3 + pip installed (for Jupyter)

1. Installing and Setting Up Hadoop

Hadoop provides distributed storage (HDFS) and processing. We'll configure it in pseudo-distributed mode.

Step 1: Install Java JDK 8

# Ubuntu
sudo apt install openjdk-8-jdk

# Arch
sudo pacman -S jdk8-openjdk

Verify:

java -version

Step 2: Configure Environment Variables

~/.bashrc

Paste the following-

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-8-openjdk-amd64/bin
export HADOOP_HOME=~/bigdata/hadoop-3.3.6/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh

Apply changes:

source ~/.bashrc

Step 3: Install SSH and Download Hadoop

# Ubuntu
sudo apt-get install ssh

# Arch
sudo pacman -S ssh

Download and extract Hadoop:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz -P ~/Downloads
tar -zxvf ~/Downloads/hadoop-3.3.6.tar.gz -C ~/bigdata

Edit hadoop-env.sh:

cd ~/bigdata/hadoop-3.3.6/etc/hadoop
nano hadoop-env.sh

# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Step 4: Configure Hadoop XML Files

Edit the following XML files in ~/bigdata/hadoop-3.3.6/etc/hadoop/:

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <!-- Proxy users for multi-component access -->
  <property>
    <name>hadoop.proxyuser.$USER.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.$USER.groups</name>
    <value>*</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/$USER/bigdata/hadoop-3.3.6/data/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/$USER/bigdata/hadoop-3.3.6/data/datanode</value>
  </property>
</configuration>

mapred-site.xml

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Step 5: Setup SSH Keys

Why SSH? Hadoop communicates securely with itself and other nodes. Even in pseudo-distributed mode, SSH is your friend.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost   # Test SSH

Step 6: Format HDFS and Start Hadoop

export PDSH_RCMD_TYPE=ssh
~/bigdata/hadoop-3.3.6/bin/hdfs namenode -format
start-all.sh    # To start Hadoop daemons
stop-all.sh     # To stop when needed

Verify via HDFS UI: http://localhost:9870

Create a user directory:

hdfs dfs -mkdir -p /user/$USER

2. Installing and Configuring Apache Flume

Flume ingests log files into HDFS.

Step 1: Download Flume

wget https://downloads.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz -P ~/bigdata
tar -xzf ~/bigdata/apache-flume-1.11.0-bin.tar.gz -C ~/bigdata

Step 2: Set Environment Variables

Add to ~/.bashrc:

export FLUME_HOME=~/bigdata/apache-flume-1.11.0-bin
export PATH=$PATH:$FLUME_HOME/bin

source ~/.bashrc.

Step 3: Configure Flume

Create ~/bigdata/config/flume.conf:

agent.sources = log-source
agent.channels = memory-channel
agent.sinks = hdfs-sink

# Source: Monitor log directory
agent.sources.log-source.type = spooldir
agent.sources.log-source.spoolDir = /home/$USER/bigdata/logs
agent.sources.log-source.fileHeader = true

# Channel: In-memory
agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity = 1000

# Sink: Write to HDFS
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:9000/user/$USER/logs
agent.sinks.hdfs-sink.hdfs.filePrefix = web-log
agent.sinks.hdfs-sink.hdfs.rollInterval = 300
agent.sinks.hdfs-sink.hdfs.rollSize = 10485760
agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Connect components
agent.sources.log-source.channels = memory-channel
agent.sinks.hdfs-sink.channel = memory-channel

Step 4: Prepare HDFS Directory

hdfs dfs -mkdir -p /user/$USER/logs

3. Ingesting Data with Flume

Step 1: Download Sample Logs

We'll use NASA access logs:

wget ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz -P ~/bigdata/logs
gunzip ~/bigdata/logs/NASA_access_log_Jul95.gz
mv ~/bigdata/logs/NASA_access_log_Jul95 ~/bigdata/logs/access.log

Step 2: Run Flume Agent

Grant permissions and start:

hdfs dfs -chmod -R 777 /
flume-ng agent --conf ~/bigdata/config --conf-file ~/bigdata/config/flume.conf --name agent -Dflume.root.logger=INFO,console

Verify ingestion:

hdfs dfs -ls /user/$USER/logs
hdfs dfs -cat /user/$USER/logs/web-log.* | head

You should see log entries like:

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

4. Installing and Configuring Apache Spark

Spark processes the ingested logs.

Step 1: Download Spark

wget https://downloads.apache.org/spark/spark-3.5.7/spark-3.5.7-bin-hadoop3.tgz -P ~/bigdata
tar -xzf ~/bigdata/spark-3.5.7-bin-hadoop3.tgz -C ~/bigdata

Step 2: Set Environment Variables

Add to ~/.bashrc:

export SPARK_HOME=~/bigdata/spark-3.5.7-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin

source ~/.bashrc

Step 3: Configure Spark

Create ~/bigdata/spark-3.5.7-bin-hadoop3/conf/spark-env.sh:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64  # Adjust for Arch if needed
export HADOOP_HOME=~/bigdata/hadoop-3.3.6
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077

Verify:

spark-submit --version

Step 4: Create Spark Processing Script

Create ~/bigdata/log_analysis.py.

Get it here.

This script parses the logs, computes key metrics, and saves them as Parquet files.

Step 5: Run Spark Job

spark-submit --master local[*] ~/bigdata/log_analysis.py --input hdfs://localhost:9000/user/$USER/logs/* --output hdfs://localhost:9000/user/$USER/logs_output_full

Verify:

hdfs dfs -ls /user/$USER/logs_output_full

5. Installing and Configuring Hive

Hive allows SQL-like querying on the processed data.

Step 1: Download Hive

wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz -P ~/bigdata
tar -xzf ~/bigdata/apache-hive-3.1.3-bin.tar.gz -C ~/bigdata

Step 2: Set Environment Variables

Add to ~/.bashrc:

export HIVE_HOME=~/bigdata/apache-hive-3.1.3-bin
export PATH=$PATH:$HIVE_HOME/bin

source ~/.bashrc

Step 3: Configure Hive

Create ~/bigdata/apache-hive-3.1.3-bin/conf/hive-site.xml:

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:;databaseName=/home/$USER/bigdata/hive-3.1.3/metastore_db;create=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>hdfs://localhost:9000/user/hive/warehouse</value>
    </property>
</configuration>

Step 4: Initialize Hive

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive
schematool -initSchema -dbType derby

Step 5: Start Hive and Query

Start server:

hiveserver2 &

Connect with Beeline:

beeline -u jdbc:hive2://localhost:10000

Create database and external tables:

CREATE DATABASE log_analysis;
USE log_analysis;

CREATE EXTERNAL TABLE top_urls (url STRING, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/top_urls';

CREATE EXTERNAL TABLE top_ips (ip STRING, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/top_ips';

CREATE EXTERNAL TABLE status_counts (status STRING, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/status_counts';

CREATE EXTERNAL TABLE daily_hits (`date` DATE, count BIGINT)
STORED AS PARQUET
LOCATION 'hdfs://localhost:9000/user/$USER/logs_output_full/daily_hits';

Query example (top 10 URLs):

SELECT url, count FROM top_urls ORDER BY count DESC LIMIT 10;

Export to CSV:

beeline -u jdbc:hive2://localhost:10000 --outputformat=csv2 -e \
"USE log_analysis; SELECT url, count FROM top_urls ORDER BY count DESC LIMIT 10;" > ~/bigdata/top_urls.csv

Repeat for other tables.

6. Visualizing the Results

To make insights more accessible, visualize the queried data using Python and Matplotlib. Export the CSVs as above, then create a visualization script.

Step 1: Install Required Libraries (if needed)

Ensure Pandas and Matplotlib are available (they come with many Python environments; install via

pip install pandas matplotlib

Step 2: Create Visualization Script

Create ~/bigdata/visualize_logs.py:

import pandas as pd
import matplotlib.pyplot as plt

# Load CSVs
top_urls = pd.read_csv('~/bigdata/top_urls.csv', header=None, names=['url', 'count'])
status_counts = pd.read_csv('~/bigdata/status_counts.csv', header=None, names=['status', 'count'])
daily_hits = pd.read_csv('~/bigdata/daily_hits.csv', header=None, names=['date', 'count'])

# Plot Top URLs (Bar Chart)
plt.figure(figsize=(10, 6))
plt.barh(top_urls['url'][:10], top_urls['count'][:10])
plt.xlabel('Count')
plt.ylabel('URL')
plt.title('Top 10 URLs by Hits')
plt.gca().invert_yaxis()
plt.savefig('~/bigdata/top_urls.png')
plt.show()

# Plot Status Counts (Pie Chart)
plt.figure(figsize=(8, 8))
plt.pie(status_counts['count'], labels=status_counts['status'], autopct='%1.1f%%')
plt.title('HTTP Status Code Distribution')
plt.savefig('~/bigdata/status_counts.png')
plt.show()

# Plot Daily Hits (Line Chart)
daily_hits['date'] = pd.to_datetime(daily_hits['date'])
plt.figure(figsize=(12, 6))
plt.plot(daily_hits['date'], daily_hits['count'])
plt.xlabel('Date')
plt.ylabel('Hits')
plt.title('Daily Website Hits')
plt.xticks(rotation=45)
plt.savefig('~/bigdata/daily_hits.png')
plt.show()

Step 3: Run the Script

python ~/bigdata/visualize_logs.py

This generates PNG files for each visualization. You can embed them in reports or dashboards. For more advanced visuals, consider integrating with tools like Tableau or using Spark's built-in plotting in notebooks (e.g., via Jupyter with PySpark).

This completes the pipeline! You've now ingested, processed, queried, and visualized web logs using a robust big data ecosystem. If you encounter issues, check logs in $HADOOP_HOME/logs or Flume console output. Happy analyzing!