This post is a guide to deploy a Hadoop cluster by minikube.

macOS

Prerequisite

1
$ brew install qemu socket_vmnet helm kubectl

Start Minikube

1
2
$ minikube start --driver qemu --network socket_vmnet \
    --cpus "$(($(nproc) / 2))" --memory "$(nproc)g"

Ubuntu 24.04

Prerequisite

1
2
3
4
5
6
# https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user
$ sudo groupadd docker

$ sudo usermod -aG docker $USER

$ newgrp docker

Start Minikube

1
2
$ minikube start --driver docker \
    --cpus "$(($(nproc) / 2))" --memory "$(nproc)g"

Install Hadoop Cluster by Helm

1
2
3
4
5
6
7
$ git clone https://github.com/adonis0147/helm-hadoop

$ cd helm-hadoop

$ bash docker/build_image.sh

$ helm install --name-template hadoop .

Check the Status of Hadoop Cluster

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ kubectl get pods

NAME   READY   STATUS    RESTARTS        AGE     IP            NODE       NOMINATED NODE   READINESS GATES
dn-0   1/1     Running   0               5m14s   10.244.0.21   minikube   <none>           <none>
dn-1   1/1     Running   0               5m9s    10.244.0.27   minikube   <none>           <none>
dn-2   1/1     Running   0               5m5s    10.244.0.30   minikube   <none>           <none>
hs-0   1/1     Running   0               5m13s   10.244.0.22   minikube   <none>           <none>
jn-0   1/1     Running   0               5m14s   10.244.0.24   minikube   <none>           <none>
jn-1   1/1     Running   0               5m7s    10.244.0.29   minikube   <none>           <none>
jn-2   1/1     Running   0               5m1s    10.244.0.31   minikube   <none>           <none>
nm-0   1/1     Running   0               5m14s   10.244.0.20   minikube   <none>           <none>
nm-1   1/1     Running   0               5m9s    10.244.0.25   minikube   <none>           <none>
nm-2   1/1     Running   0               5m6s    10.244.0.28   minikube   <none>           <none>
nn-0   1/1     Running   0               5m14s   10.244.0.19   minikube   <none>           <none>
nn-1   1/1     Running   1 (4m38s ago)   5m9s    10.244.0.26   minikube   <none>           <none>
rm-0   1/1     Running   0               5m14s   10.244.0.23   minikube   <none>           <none>

Hadoop Cluster

  • Namenode: 2
  • Journalnode: 3
  • Datanode: 3
  • Resourcemanager: 1
  • Nodemanager: 3
  • Historyserver: 1

Access the services

macOS

1
2
3
4
5
6
# Set route up
$ sudo route -n delete 10.244.0.0/16
$ sudo route -n add 10.244.0.0/16 "$(minikube ip)"

# Don't kill this process
$ minikube tunnel

Ubuntu 24.04

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Set DNS up
$ interface="$(netstat -nr | grep "$(minikube ip | sed -n 's/\(.*\)\..*/\1.0/p')" |
    awk '{print $NF}')"
$ sudo resolvectl dns "${interface}" \
    "$(kubectl get -n kube-system service --no-headers | awk '{print $3}')"
$ sudo resolvectl domain "${interface}" cluster.local

# Set route up
$ sudo route del -net 10.244.0.0 netmask 255.255.0.0
$ sudo route add -net 10.244.0.0 netmask 255.255.0.0 gw "$(minikube ip)"

# Don't kill this process
$ minikube tunnel

Test

Ping

1
2
3
4
5
$ ping nn-0.namenode.default.svc.cluster.local

PING nn-0.namenode.default.svc.cluster.local (10.244.0.19) 56(84) bytes of data.
64 bytes from nn-0.namenode.default.svc.cluster.local (10.244.0.19): icmp_seq=1 ttl=63 time=0.069 ms
64 bytes from nn-0.namenode.default.svc.cluster.local (10.244.0.19): icmp_seq=2 ttl=63 time=0.079 ms

Access HDFS

1
2
3
4
$ kubectl exec -it nn-0 -- hadoop fs -ls /

Found 1 items
drwxrwx---   - root supergroup          0 2025-05-05 11:10 /tmp

MapReduce Wordcount

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ kubectl exec -it rm-0 -- bash -c 'for i in {0..999}; do echo ${i} >>numbers; done'

$ kubectl exec -it rm-0 -- hadoop fs -put numbers /numbers

$ kubectl exec -it rm-0 -- hadoop jar \
    hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar wordcount \
    /numbers /output

2025-05-05 11:29:30,858 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at rm-0.resourcemanager.default.svc.cluster.local/10.244.0.23:8032
2025-05-05 11:29:31,682 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1746443418771_0001
2025-05-05 11:29:32,242 INFO input.FileInputFormat: Total input files to process : 1
2025-05-05 11:29:32,492 INFO mapreduce.JobSubmitter: number of splits:1
2025-05-05 11:29:32,711 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1746443418771_0001
2025-05-05 11:29:32,712 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-05-05 11:29:33,054 INFO conf.Configuration: resource-types.xml not found
2025-05-05 11:29:33,055 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2025-05-05 11:29:33,723 INFO impl.YarnClientImpl: Submitted application application_1746443418771_0001
2025-05-05 11:29:33,854 INFO mapreduce.Job: The url to track the job: http://rm-0.resourcemanager.default.svc.cluster.local:8088/proxy/application_1746443418771_0001/
2025-05-05 11:29:33,856 INFO mapreduce.Job: Running job: job_1746443418771_0001
2025-05-05 11:29:45,261 INFO mapreduce.Job: Job job_1746443418771_0001 running in uber mode : false
2025-05-05 11:29:45,263 INFO mapreduce.Job:  map 0% reduce 0%
2025-05-05 11:29:51,384 INFO mapreduce.Job:  map 100% reduce 0%
2025-05-05 11:30:00,470 INFO mapreduce.Job:  map 100% reduce 100%
2025-05-05 11:30:00,494 INFO mapreduce.Job: Job job_1746443418771_0001 completed successfully
2025-05-05 11:30:00,638 INFO mapreduce.Job: Counters: 54
...

$ kubectl exec -it rm-0 -- hadoop fs -ls /output

Found 2 items
-rw-r--r--   3 root supergroup          0 2025-05-05 11:29 /output/_SUCCESS
-rw-r--r--   3 root supergroup       5890 2025-05-05 11:29 /output/part-r-00000

Spark SparkPi

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Download Apache Spark
$ curl -LO 'https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz'
$ tar -zxvf spark-3.5.5-bin-hadoop3.tgz

# Set Hadoop client up
$ cd spark-3.5.5-bin-hadoop3
$ SPARK_HOME="$(pwd)"

$ mkdir -p conf/hadoop
$ kubectl get configmaps hdfs-conf -o jsonpath="{.data.core-site\.xml}" \
    >conf/hadoop/core-site.xml
$ kubectl get configmaps hdfs-conf -o jsonpath="{.data.hdfs-site\.xml}" \
    >conf/hadoop/hdfs-site.xml
$ kubectl get configmaps yarn-conf -o jsonpath="{.data.yarn-site\.xml}" \
    >conf/hadoop/yarn-site.xml

# Set Spark up
$ cat >conf/spark-defaults.conf <<EOF
spark.master                  yarn
spark.submit.deployMode       cluster

spark.eventLog.enabled        true
spark.eventLog.dir            hdfs:///tmp/spark-events
spark.history.fs.logDirectory hdfs:///tmp/spark-events
EOF

$ cat >conf/spark-env.sh <<EOF
HADOOP_CONF_DIR="${SPARK_HOME}/conf/hadoop"
YARN_CONF_DIR="${SPARK_HOME}/conf/hadoop"
HADOOP_USER_NAME='root'
EOF

# Run SparkPi
$ kubectl exec -it nn-0 -- hadoop fs -mkdir -p /tmp/spark-events

$ ./bin/run-example org.apache.spark.examples.SparkPi 10000

25/05/16 16:34:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/16 16:34:09 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at rm-0.resourcemanager.default.svc.cluster.local/10.244.0.4:8032
25/05/16 16:34:10 INFO Configuration: resource-types.xml not found
25/05/16 16:34:10 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/05/16 16:34:10 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
25/05/16 16:34:10 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
25/05/16 16:34:10 INFO Client: Setting up container launch context for our AM
25/05/16 16:34:10 INFO Client: Setting up the launch environment for our AM container
25/05/16 16:34:10 INFO Client: Preparing resources for our AM container
25/05/16 16:34:10 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
25/05/16 16:34:13 INFO Client: Uploading resource file:/private/var/folders/pq/ffp2x74d7gs2_y_z5v7q4dn00000gp/T/spark-5e05a1af-feba-4e50-a0ec-e05c0be847a7/__spark_libs__18048069634091976738.zip -> hdfs://hadoop-cluster/user/root/.sparkStaging/application_1747383152432_0004/__spark_libs__18048069634091976738.zip
25/05/16 16:34:14 INFO Client: Uploading resource file:/spark-3.5.5-bin-hadoop3/examples/jars/spark-examples_2.12-3.5.5.jar -> hdfs://hadoop-cluster/user/root/.sparkStaging/application_1747383152432_0004/spark-examples_2.12-3.5.5.jar
25/05/16 16:34:14 INFO Client: Uploading resource file:/spark-3.5.5-bin-hadoop3/examples/jars/scopt_2.12-3.7.1.jar -> hdfs://hadoop-cluster/user/root/.sparkStaging/application_1747383152432_0004/scopt_2.12-3.7.1.jar
25/05/16 16:34:14 WARN Client: Same name resource file:///spark-3.5.5-bin-hadoop3/examples/jars/spark-examples_2.12-3.5.5.jar added multiple times to distributed cache
25/05/16 16:34:14 INFO Client: Uploading resource file:/private/var/folders/pq/ffp2x74d7gs2_y_z5v7q4dn00000gp/T/spark-5e05a1af-feba-4e50-a0ec-e05c0be847a7/__spark_conf__8161596058161130912.zip -> hdfs://hadoop-cluster/user/root/.sparkStaging/application_1747383152432_0004/__spark_conf__.zip
25/05/16 16:34:14 INFO SecurityManager: Changing view acls to: adonis,root
25/05/16 16:34:14 INFO SecurityManager: Changing modify acls to: adonis,root
25/05/16 16:34:14 INFO SecurityManager: Changing view acls groups to:
25/05/16 16:34:14 INFO SecurityManager: Changing modify acls groups to:
25/05/16 16:34:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: adonis, root; groups with view permissions: EMPTY; users with modify permissions: adonis, root; groups with modify permissions: EMPTY
25/05/16 16:34:14 INFO Client: Submitting application application_1747383152432_0004 to ResourceManager
25/05/16 16:34:14 INFO YarnClientImpl: Submitted application application_1747383152432_0004
25/05/16 16:34:15 INFO Client: Application report for application_1747383152432_0004 (state: ACCEPTED)
...

# Check logs on historyserver
Log Type: stdout
Log Upload Time: Fri May 16 08:34:39 +0000 2025
Log Length: 33
Pi is roughly 3.1416564911416565

Reference