• +1 510-870-8668, 510-298-5936, 510-796-2400
  • Login

Kafka

It aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.

Miri Infotech is launching a product which will configure and publish Apache Kafka, to produce free implementations of distributed or otherwise scalable and high availability which is embedded pre-configured tool with Ubuntu and ready-to-launch AMI on Amazon EC2 that contains Hadoop, Hbase and Apache Kafka.

Before going into deep, one must learn that whatever we are using, what good it stands for?

And to understand this, we have two applications that are as follows:

  1.  Building real-time streaming data pipelines that reliably get data between systems or applications
  2.  Building real-time streaming applications that transform or react to the streams of data

It is one of the most popular tool among the developers around the world as it is easy to pick up and such a platform with 4APIs namely Producer, Consumer, Streams, and Connect.

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Without having the basic knowledge, one cannot deeply understand its nature and how it works. For that we should understand a few basic concepts about Apache Kafka:

  • Kafka run as a cluster on one or more servers.
  • The Kafka cluster stores streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.

Topics and Logs

  • Let’s first dive into the core abstraction Kafka provides for a stream of records—The topic.
  • A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
  • Kafka stores messages which come from arbitrarily many processes called "producers". The data can thereby be partitioned in different "partitions" within different "topics". Within a partition the messages are indexed and stored together with a timestamp.
  • Other processes called "consumers" can query messages from partitions. Kafka runs on a cluster of one or more servers and the partitions can be distributed across cluster nodes.
  • Apache Kafka efficiently processes the real-time and streaming data when implemented along with Apache Storm, Apache HBase and Apache Spark. Deployed as a cluster on multiple servers, Kafka handles its entire publish and subscribe messaging system with the help of four APIs, namely, producer API, consumer API, streams API and connector API. Its ability to deliver massive streams of message in a fault-tolerant fashion has made it replace some of the conventional messaging systems like JMS, AMQP, etc.

 

Kafka live cast:
 

You can subscribe Kafka to an AWS Marketplace product and launch an instance from the Mahout product's AMI using the Amazon EC2 launch wizard.

To launch an instance from the AWS Marketplace using the launch wizard:

  • Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
  • From the Amazon EC2 dashboard, choose Launch Instance.
  • On the Choose an Amazon Machine Image (AMI) page, choose the AWS Marketplace category on the left. Find a suitable AMI by browsing the categories, or using the search functionality. Choose Select to choose your product.
  • A dialog displays an overview of the product you've selected. You can view the pricing information, as well as any other information that the vendor has provided. When you're ready, choose Continue.
  • On the Choose an Instance Type page, select the hardware configuration and size of the instance to launch. When you're done, choose Next: Configure Instance Details.
  • On the next pages of the wizard, you can configure your instance, add storage, and add tags. For more information about the different options you can configure, see Launching an Instance. Choose Next until you reach the Configure Security Group page.
  • The wizard creates a new security group according to the vendor's specifications for the product. The security group may include rules that allow all IP addresses (0.0.0.0/0) access on SSH (port 22) on Linux or RDP (port 3389) on Windows. We recommend that you adjust these rules to allow only a specific address or range of addresses to access your instance over those ports.
  • When you are ready, choose Review and Launch.
  • On the Review Instance Launch page, check the details of the AMI from which you're about to launch the instance, as well as the other configuration details you set up in the wizard. When you're ready, choose Launch to select or create a key pair, and launch your instance.
  • Depending on the product you've subscribed to, the instance may take a few minutes or more to launch. You are first subscribed to the product before your instance can launch. If there are any problems with your credit card details, you will be asked to update your account details. When the launch confirmation page displays.

About

Apache Kafka is a distributed commit log for fast, fault-tolerant communication between producers and consumers using message based topics. It provides the messaging backbone for building a new generation of distributed applications capable of handling billions of events and millions of transactions.

Guidelines

  • Avoidance of cryptic abbreviations would be better.
  • Clear code is preferable to comments. When possible make your naming so good you don't need comments. When that isn't possible comments should be thought of as mandatory, write them to be read.
  • Logging, configuration, and public APIs are our "UI". Make them pretty, consistent, and usable.
  • There is not a maximum line length but should be reasonable.
  • Don't be sloppy. Don't check in commented out code: we use version control; it is still there in the history. Don't leave TODOs in the code or FIXMEs if you can help it. Don't leave println statements in the code.
  • The code should not be duplicated.
  • Kafka is system software, and certain things are appropriate in system software that are not appropriate elsewhere. Sockets, bytes, concurrency, and distribution are our core competency which means we will have a more "from scratch" implementation of some of these things then would be appropriate for software elsewhere in the stack. This is because we need to be exceptionally good at these things. This does not excuse fiddly low-level code, but it does excuse spending a little extra time to make sure that our file system structures, networking code, threading model, are all done perfectly right for our application rather than just trying to glue together ill-fitting off-the-shelf pieces.

Limitations:

Configurations

  • The “disk” configuration value is denominated in MB. We recommend you set the configuration value log_retention_bytes to a value smaller than the indicated “disk” configuration.

Security

  • The security features introduced in Apache Kafka 0.9 are not supported at this time.

Out-Of-Band Configuration

  • Out-of-band configuration modifications are not supported. The service’s core responsibility is to deploy and maintain the service with a specified configuration. In order to do this, the service assumes that it has ownership of task configuration. If an end-user makes modifications to individual tasks through out-of-band configuration operations, the service will override those modifications at a later time. For example:
  •  
    1. If a task crashes, it will be restarted with the configuration known to the scheduler, not one modified out-of-band.
    2. If a configuration update is initiated, all out-of-band modifications will be overwritten during the rolling update.

Scaling In

  • To prevent accidental data loss, the service does not support reducing the number of pods.

Disk Changes

  • To prevent accidental data loss from reallocation, the service does not support changing volume requirements after initial deployment.

Best-effort installation

  • If your cluster doesn’t have enough resources to deploy the service as requested, the initial deployment will not complete until either those resources are available or until you reinstall the service with corrected resource requirements.
  • Similarly, scale-outs following initial deployment will not complete if the cluster doesn’t have the needed available resources to complete the scale-out.

Virtual Networks

  • When the service is deployed on a virtual network, the service may not switch to host networking without a full re-installation. The same is true for attempting to switch from host to virtual networking.

Task Environment Variables

  • Each service task has some number of environment variables, which are used to configure the task. These environment variables are set by the service scheduler.
  • While it is possible to use these environment variables in adhoc scripts (e.g. via docs task exec), the name of a given environment variable may change between versions of a service and should not be considered a public API of the service.

Usage and Deployment Instruction:

Step 1: Open Putty for SSH

Step 2: Open Putty and Type <instance public IP> at “Host Name”

Step 3: Open Connection->SSH->Auth tab from Left Side Area 

Step 4: Click on browse button and select ppk file for Instance and then click on Open

Step 5: Type "ubuntu" as user name Password auto taken from PPK file

Step 5.1: If you get any update option from Ubuntu, then you have to follow the following steps:

After then follow the following commands

$ apt-get update

$ apt-get upgrade

Step 6: Use following Linux command to Start Kafka and Zookeeper

Step 6.1: $ sudo su 

To change root user

Step 6.2: $ sudo vi /etc/hosts

Take the Private Ip address from your machine as per the below screenshot and then replace the second line of your command screen with that Private ip address

Steps 6.3: To stop Kafka if already running

>> sudo /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-server-stop.sh

Steps 6.4: How to start Kafka

>> sudo  /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-server-start.sh  /opt/Kafka/kafka_2.11-0.10.0.1/config/server.properties

 

Kafka has started and its screen would look like this shown below:

Step 6.5: Testing Kafka Server

Open new Terminal:

1.       Right click in current terminal

2.       Select Duplicate Session

Enter ‘ubuntu’

>> sudo /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1  --partitions 1 --topic <Your Test Name>

(Where <Your Test Name> should be unique name)

Example: >> sudo /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1  --partitions 1 --topic miritech

Step 6.6: Now, ask Zookeeper to list available topics on Apache Kafka by running the following command:

>>  sudo /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-topics.sh --list --zookeeper localhost:2181

Step 6.7:  Now, publish a sample messages to Apache Kafka topic called miritech by using the following producer command:

>>  sudo /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic miritech 

Open new Terminal again:

3.     Right click in current terminal 

4     Select Duplicate Session

Enter ‘ubuntu’

>> sudo su

Step 6.7: Now, use consumer command to check for messages on Apache Kafka Topic called testing by running the following command:

>> sudo /opt/Kafka/kafka_2.11-0.10.0.1/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic miritech --from-beginning

Step 7: If you need to work with Hadoop then please follow below steps:

Step 7.1:  $ ssh-keygen -t rsa -P ""

This command is used to generate the ssh key.

Step 7.2:  $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

This command is used to move the generated ssh key to the desired location.

Step 7.3: ssh localhost

Step 7.4: hdfs namenode –format

You have to write “yes” when it prompts you – Are you sure you want to continue?

Step 7.5: start-all.sh

Step 7.6: After the above command executes successfully, you should check the below urls in the browser -

http://<instance-public-ip>:8088

http://<instance-public-ip>:50070

http://<instance-public-ip>:50090

Step 8: Start Hbase

$ cd /usr/local/hbase/bin

$ start-hbase.sh

 
Live Demo

Our Rating

5 star
0
4 star
0
3 star
0
2 star
0
1 star
0

Submit Your Request

First Name:*
Last Name:*
Company/Organisation:*
Email Address:*
Phone Number:*
Message:*