• +1 510-870-8668, 510-298-5936, 510-796-2400
  • Login


StreamSets, a software that delivers performance management for data flows that feed the next generation of big data applications is all set to be launched by Miri infotech. With its data operations platform, one can efficiently develop batch and streaming dataflows, operate them with full visibility and control.


Miri Infotech is launching a product which will configure and publish StreamSets, to a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques which is embedded pre-configured tool with Ubuntu 16.04 and ready-to-launch AMI on Amazon EC2 that contains Data Collector, Hadoop, HBase, NoSQL, Messaging system and Search System.

One of the main feature of StreamSets is that, the developers can build batch with a minimum of code while operators will use a cloud native product to aggregate dozens of dataflows into topologies and manage them centrally.Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. 

StreamSets Connectors

We have the various connectors which are as follows:

  • Hadoop systems
  • NoSQL databases
  • Search systems
  • Messaging systems


If we talk about the products associated with StreamSets, we have one of the most usable and amazing product i.e., The StreamSets Data Collector.

Let’s now understand what actually a data collector is and its related functionality:

StreamSets Data Collector is a lightweight, powerful engine that streams data in real time. Use Data Collector to route and process data in your data streams.

To define the flow of data for Data Collector, you configure a pipeline. A pipeline consists of stages that represent the origin and destination of the pipeline, and any additional processing that you want to perform. After you configure the pipeline, you click Start and Data Collector goes to work.

Data Collector processes data when it arrives at the origin and waits quietly when not needed. You can view real-time statistics about your data, inspect data as it passes through the pipeline, or take a close look at a snapshot of data.

The most important feature of using it is its smart and self-healing dataflow nature. It simplifies development cycles and build dataflow pipelines in minutes rather than in hours or days.

Common Use Cases are:

  • Apache kafka enablement
  • Hadoop ingest
  • Cloud migration
  • Search enablement

We have some more use cases which can be discussed briefly as to know how streamSets is used and its functionality regarding all such cases

These are as follows:

EDW Replatform- By using StreamSets, we can take advantage of Hadoop as a low cost, high performance, can ingest all of your data from transactional to unstructured, batch and streaming.

Internet of Things- Processing data from a multitude of diverse Internet sensors isn’t for the faint of heart. But for StreamSets, it is an easy task. StreamSets can handle the volume and the ever-shifting data formats that come with this new world.

Cybersecurity- StreamSets can be used as a part of your Apache spot cybersecurity. Bring Big Data to cybersecurity.


StreamSets live cast:

You can subscribe to an AWS Marketplace product and launch an instance from the product's AMI using the Amazon EC2 launch wizard.

  •  To launch an instance from the AWS Marketplace using the launch wizard
  •  Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
  •  From the Amazon EC2 dashboard, choose Launch Instance.
  • On the Choose an Amazon Machine Image (AMI) page, choose the AWS Marketplace category on the left. Find a suitable AMI by browsing the categories, or using the search functionality. Choose Select to choose your product.
  • A dialog displays an overview of the product you've selected. You can view the pricing information, as well as any other information that the vendor has provided. When you're ready, choose Continue.
  • On the Choose an Instance Type page, select the hardware configuration and size of the instance to launch. When you're done, choose Next: Configure Instance Details.
  • On the next pages of the wizard, you can configure your instance, add storage, and add tags. For more information about the different options you can configure, see Launching an Instance. Choose Next until you reach the Configure Security Group page.
  • The wizard creates a new security group according to the vendor's specifications for the product. The security group may include rules that allow all IP addresses ( access on SSH (port 22) on Linux or RDP (port 3389) on Windows. We recommend that you adjust these rules to allow only a specific address or range of addresses to access your instance over those ports.
  • When you are ready, choose Review and Launch.
  • On the Review Instance Launch page, check the details of the AMI from which you're about to launch the instance, as well as the other configuration details you set up in the wizard. When you're ready, choose Launch to select or create a key pair, and launch your instance.
  • Depending on the product you've subscribed to, the instance may take a few minutes or more to launch. You are first subscribed to the product before your instance can launch. If there are any problems with your credit card details, you will be asked to update your account details. When the launch confirmation page displays


  • StreamSets Data Collector is an enterprise grade, open source, continuous big data ingestion infrastructure.
  • Good part about streamsets is that it has easy and flexible user interface that lets data scientists, developers and data infrastructure teams to easily create data pipelines in a fraction of time.
  • It also reads from and writes to a large number of end-points, including S3, JDBC, Hadoop, Kafka, Cassandra and many others


  • The StreamSets Data Collector can be used like a pipe for a data stream.
  • Throughout the enterprise data topology, the streams of data are to be moved, collected, and processed on the way to their destinations. Data Collector provides the crucial connection between hops in the stream.
  • To solve one’s ingest needs, one can use a single Data Collector to run one or more pipelines. Or might install a series of Data Collectors to stream data across your enterprise data topology.

Usage and Deployment Instruction

Step 1: Open Putty for SSH

Step 2: Open Putty and Type <instance public IP> at “Host Name” and Type "ubuntu" as user name Password auto taken from PPK file

Step 3: Use following Linux command to start Streamset

Step 3.1: $ sudo vi /etc/hosts

Take the Private Ip address from your machine as per the below screenshot and then replace the second line of your command screen with that Private ip address

Step 3.2: Change ssh ‘ubuntu’ user to ‘root’

>> sudo su

>> cd /home/Ubuntu

Step 3.3: Change open file limit.

>> ulimit -n 40000

Step 3.4: Start StreamSets datacollector server.

>> streamsets-datacollector- dc

Step 3.5: Now start Streamsets in the Browser.

Open the URL:  http://<instance ip address>:18630/

IP address of the running EC2 instance.

Username: admin 

Password: admin

Step 3.6: After login you will see the Streamsets Dashboard.


Live Demo

Our Rating

5 star
4 star
3 star
2 star
1 star

Submit Your Request

First Name:*
Last Name:*
Email Address:*
Phone Number:*