Apache Hive in the vein!
More and more, we have to deal with large volumes of data that are created and need to be used at an unbelievable Speed, having a huge variation, almost impossible for a human being to follow, to be concerned with its Veracity, and to be able to add value to the business in a way effective. (The 5 V of Big data).
To deal with this, the term “Big data” came up, and several solutions to deal with these problems in different scenarios, such as Apache Hive.
According to IBM, “Apache Hive is open source data warehouse software (Open source) for reading, writing and managing large data set files that are stored directly on the Distributed File System as Apache Hadoop (HDFS) or on other data storage systems, such as Apache HBase. Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements for querying and analyzing data. It was designed to make MapReduce programming easier, because you don’t need to know and write extensive Java code. Instead, you can write queries more simply in HQL, and Hive can create the map and reduce functions. ”
As with any database management system (DBMS) today, it can be accessed via commands on a command-line interface, via a JDBC, ODBC connection, or a custom driver/connector.
In addition to being based on SQL that we are already used to seeing in other databases, it also integrates Hadoop functions in its query language (HIVEQL); with that, we have the possibility to use MapReduce, for example.
Looking for performance with HiveQL, we can use files in the format RCFile, AVRO, ORC, or Apache Parquet, enable Vectorization, Serialize or Deserialize the data, identify the workload in queries, use Skew Joins, concurrent connections or cursors, and use of Tez-Execution Engine. #These are just a few alternatives, and there is much more…
Because it is a Hadoop-based solution, it is widely used integrated with other solutions in this ecosystem, such as Apache Spark, often being part of a means to implement extraction (Extract), transformation (Transform), and data loading.
Most of the time, users want to use the data being processed in Spark and record it in Hive or vice versa; for that, we can configure Spark or create a new section to establish this connection.
How to connect to remote hive server from spark
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this, Put hive-site.xml on your…
Many people do not like to use SparkSQL because of its performance in manipulating data. Because it is generally used in conjunction with DataFrames, to try to optimize this, we can make use of PyArrow.
Optimizing Conversion between Spark and Pandas DataFrames using Apache PyArrow
If you are a Spark user that prefers to work in Python and Pandas, here we’re going to explore how can we take benefit…
Spark uses the Java Virtual Machine (JVM), and with that, we have the villainous Garbage Collection (resource allocation and deallocation manager) and its unwanted behavior when subjected to multiple processors and large amounts of data. In newer versions of Java, we have different implementations used during code execution in Spark for performance.
Tuning Java Garbage Collection for Apache Spark Applications
This is a guest post from our friends in the SSG STO Big Data Technology group at Intel. Join us at the Spark Summit to…
NOTE: Before using, check the software version and compatibility between them!
To use Hive with Python, in addition to the possibility of making a JDBC connection, we can use the PyHive library; with this, in addition to further simplifying the use of Hive, we have the option of applying cursors to work with large volumes of data or use the interface do Presto (or PrestoDB).
PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive. DB-API DB-API (asynchronous) In…
NOTE: Make sure your operating system has the necessary libraries to connect to Hive!
#HIVE ON DOCKER
With the arrival of DataOps initiatives in the data area, many solutions will start to gain use in the container environment. This type of implementation abstracts much of the need for configuration management and can be used to build infrastructures quickly, easily, and effectively.
#Imagine the difficulty of configuring 1000 nodes with Hive, with Terraform, docker, and Ansible you can achieve, will you? ….
Some examples of Hive implementations with Docker:
This is a docker container for Apache Hive 2.3.2. It is based on https://github.com/big-data-europe/docker-hadoop so…
Setup Apache Hive in Docker
Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being…
Apache Hive TM
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in…
Apache Software Foundation
Table of Contents You can install a stable release of Hive by downloading a tarball, or you can download the source…
What is Apache Hive?
Hive is an open source data warehouse software, similar to SQL, for reading, writing and managing large data sets that…
7 Best Hive Optimization Techniques - Hive Performance - DataFlair
There are several types of Hive Query Optimization techniques are available while running our hive queries to improve…
Hive Performance Tuning - Hadoop Online Tutorials
In our previous post we have discussed about hadoop job optimization or Hadoop Performance Tuning for Mapreduce jobs…
#Thanks for your reading! :)