BI-FI Blogs

Retail Store Clustering with just one click

January 29, 2023

Retail store clustering is a data analysis technique that groups similar stores based on various characteristics. It allows retailers to identify regional market trends and preferences, enabling them to tailor their product offerings and marketing strategies to local demand.

Better Understanding of Market Demands
Improved Store Planning and Operations
Better Resource Allocation
Improved Customer Experience
Competitive Advantage

With our clustering solution, you can customize the inputs and assign weight to each input and analyze the results and iterate as many times to get a satisfactory result.

In this post, we'll go over our Clustering Solution and what users can do to use their own data.

Click here to Download the Project from Official Knime Page

Click here to download the Sample Data

First, a user should run the first step and fill the input sections with

data file location, cluster count and weight to each feature used in the model.

In the second step, data will be loaded to Knime and different data sheets will be joined. In your dataset, it is important to keep the same sheet names with sample data since they are passed a sheet name parameter.

Then in the third step, you might want to exclude certain stores from the model because they might be outliers in terms of their features.

In the fourth step, you might want to cluster the stores within their respective capacity group. For example, filtering only big stores will cluster only those stores and exclude others from the model. This might be useful if your allocation plans highly dependent on the store size.

In the fifth step, user weights will be applied to features and several transformation steps will be applied and the final table will be sent to 1. and 2. model.

To assign weight to each unknown number of product category column, we use the Column List Loop Node.

This node allows us to rename each category column to anonymous column and apply the user weight to each column. With this node, no matter how many columns of product category you have, you will be able to assign weight to each of them.

Note: There might be some adjustments needed in some nodes if your dataset has more features than sample dataset. Please leave a comment or reach us if that's the case.

After we passed the data to each model, both K-means and Hierarchical model will be ready to use.

In the sixth step, models will run and some groupings will be made to be used in results dashboards.

The result dashboards, contains :

Cluster Store Count Distribution
Store Count by Cluster and Store Capacity Group
Cluster - Store Details
Cluster Averages

Also, you can see the cluster store distribution on the map with the help of OSM Map View Node.

You can customize the map tooltip inside the 1. or 2. model by using the column filter node at top of the metanode.

You can also compare the performances of two model by checking their silhoutte scores .

You can also export the results to Excel with Excel Writer Node for further analysis.

In conclusion, retail store clustering is an essential tool for retailers to gain a competitive advantage and improve their overall performance. With our solution, you can easily cluster your stores with just a few clicks.

If you liked this project, please leave a comment below and share it on your socials.

How to create Date Table using Knime

January 28, 2023

In this post, we will show how to create date table with user specified start and end date, using Knime.

Click Here to Download the Workflow from Official Knime Page

A date table is an essential component of any data analytics project, as it allows for the accurate and efficient querying of data over time. This table is typically used in conjunction with a fact table, which contains the numerical data that is being analyzed. The date table contains the dates and other time-based information, such as month, quarter, and year, that are used to filter and aggregate the data in the fact table.

One of the main benefits of using a date table is that it allows for easy filtering of data by specific date ranges. For example, if you want to see sales data for a specific month, you can easily filter the data in the fact table by the corresponding month in the date table. This is much more efficient than trying to filter the data in the fact table by individual dates, as it would require a much more complex query.

A date table also allows for easy forecasting and trend analysis. It makes it easy to calculate trends, rolling averages, and other important metrics. By having a separate table for dates, it makes it much simpler to query for data in a specific time frame, and also allows to do time based calculations like calculating year-over-year growth or monthly growth.

Column Expression node containing date metadata calculations

With our Knime Date Table Generator, you can create your own Custom Date Tables with all the main metadata included. In addition, you can add your own metadata fields using Column Expression node* to enrich your date table.

In conclusion, using a date table in data analytics is a best practice that can greatly improve the efficiency, accuracy, and flexibility of your data analysis. It allows for easy filtering and aggregation of data, and enables powerful trend analysis and forecasting. It also ensures data quality by providing a consistent format for date and time information.

How We Automated Our Twitter Promotion with Knime

January 06, 2023Knime, Knime Automation, RSS Feed, Twitter

In this blog, we'll share how we automated our Twitter promotions with our Knime Workflow.

Click here to download the Workflow from Official Knime Page

This workflow, connects to the RSS feed of our blog to get some information like; Title, Date Published and URL.

Then it connects to Twitter, using Twitter Developer Account and Knime Twitter API. Uses the blog information and posts a promotion tweet for us with the custom hashtags we defined.

Here is the breakdown of the workflow...

In the Table Creator node, we paste the URL for the RSS feed and read the blog information with RSS Feed Reader node. (you might need to install this extension)

To only promote the latest blogs, we filter the published date column with the date of today. ( we created a "today variable" inside the metanode)

Then we remove every columns except Title and Blog URL. We use these two columns to create column variables. With the help of Group Loop node, we can pass the each Title and Blog URL to the Twitter Post Tweet Node and let the Knime do its magic and post the tweet for us.

We can customize the text, add more hashtags or links using this node.

Thanks to Knime, we don't have to manually promote our blogs.

If you liked this content, please share and leave a comment below.

How to create Delta Tables from Pyspark Dataframes

January 03, 2023Databricks, Dataframe, Delta Tables, PySpark, Spark

Delta tables are a new feature in Apache Spark that allow you to store large datasets as an efficient and scalable set of files on disk. Delta tables support fast, ACID transactions and provide efficient data management and access for data lakes and data pipelines.

To create a Delta table from a PySpark DataFrame, you can use the write.format() function with the "delta" option. For example:

df.write.format("delta").save("/path/to/delta/table")

This will create a Delta table at the specified path, and write the data from the DataFrame to the table.

You can also specify additional options when creating a Delta table. For example, you can use the mode option to specify whether to append the data to the table, overwrite the table, or throw an error if the table already exists. For example:

df.write.format("delta").mode("overwrite").save("/path/to/delta/table")

You can also use the partitionBy option to specify a column or set of columns to use as partition columns in the Delta table. For example:

df.write.format("delta").partitionBy("col1", "col2").save("/path/to/delta/table")

Write method has these default parameters :

DataFrameWriter.csv(
path,
mode=None, 
compression=None, 
sep=None, 
quote=None, 
header=None, 
nullValue=None, 
dateFormat=None, 
encoding=None, 
emptyValue=None, 
lineSep=None
)

By using Delta tables, you can efficiently store and manage large datasets in Apache Spark, and take advantage of the scalability and performance of the Delta engine.

If you found this post useful, please don't forget to share and leave a comment a below.

How to Use SQL Queries in Pyspark

January 03, 2023Databricks, PySpark, Spark SQL

PySpark is a popular open-source library for working with big data using Apache Spark. One of the key features of PySpark is the ability to write SQL queries to query data stored in a variety of data sources, including Apache Hive tables, Parquet files, and relational databases.

To write SQL queries in PySpark, you can use the spark.sql() function, which allows you to execute SQL queries on data stored in a PySpark DataFrame. Here's an example of how to use the spark.sql() function to query a DataFrame:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("My App").getOrCreate()

# Read a DataFrame from a Hive table
df = spark.table("my_table")

# Execute a SQL query on the DataFrame
result = spark.sql("SELECT * FROM my_table WHERE col1 = 'value'")

# Print the results
result.show()

In this example, we create a SparkSession and use it to read a DataFrame from a Hive table. We then use the spark.sql() function to execute a SQL query on the DataFrame, and print the results using the show() method.

You can also use the spark.sql() function to execute queries on data stored in other data sources, such as Parquet files or relational databases. To do this, you can use the read.format() function to read the data into a DataFrame, and then pass the DataFrame to the spark.sql() function. For example:

# Read a DataFrame from a Parquet file
df = spark.read.format("parquet").load("/path/to/file.parquet")

# Execute a SQL query on the DataFrame
result = spark.sql("SELECT * FROM df WHERE col1 = 'value'")

# Print the results
result.show()

By using the spark.sql() function, you can easily write and execute SQL queries on data stored in PySpark DataFrames, and leverage the power of SQL to work with big data in Python.

If you found this post useful, please don't forget to share and leave a comment a below.

Handling Duplicates with Pyspark

January 03, 2023Data Transformation, Databricks, Duplicate Values, PySpark, Spark SQL

In PySpark, you can use the dropDuplicates() function to remove duplicate rows from a DataFrame. This function allows you to specify the columns to consider when identifying duplicates, as well as whether to keep the first or the last occurrence of each duplicate row.

However, to check if there are any duplicates in your dataset,

Count the number of distinct rows in a subset and compare it with the number of total rows. If they're the same, there is no duplicates in the selected subset. Otherwise, duplicates exists.

from pyspark.sql import functions as F
cols = ['employee_name', 'department']

counts_df = df.select([
    F.countDistinct(*cols).alias('n_unique'),
    F.count('*').alias('n_rows')])
n_unique, n_rows = counts_df.collect()[0]

if n_rows == n_unique:
    print(f' Total Row {n_rows} and Unique Row {n_unique} are equal. Therefore, 
there are no duplicate rows for the given subset')
else: 
    print(f' Total Row {n_rows} and Unique Row {n_unique} are not equal. Therefore,
 there are {n_rows - n_unique}  duplicate rows for the given 
subset \n Distinct Table below:')
 

Here is an example of how to use dropDuplicates() to remove all duplicate rows from a DataFrame:

df = df.dropDuplicates() # removes all duplicate rows

You can also use the dropDuplicates() function to remove duplicates based on specific columns. For example:

df = df.dropDuplicates(subset=["col1", "col2"]) # removes duplicate rows based 
on the values in columns "col1" and "col2"

You can also use the dropDuplicates() function to keep the first or the last occurrence of each duplicate row. For example:

df = df.dropDuplicates(keep="first") # keeps the first occurrence of each duplicate row
df = df.dropDuplicates(keep="last") # keeps the last occurrence of each duplicate row

It's important to choose the appropriate method for handling duplicate values based on the context and the needs of your analysis.

If you found this post useful, please don't forget to share and leave a comment a below.

Handling Missing Values with Pyspark

January 03, 2023Data Transformation, Databricks, Missing Values, PySpark, Spark SQL

PySpark has some methods to handle missing values. You can use the dropna() function to remove rows or columns with missing values from a DataFrame. This function allows you to specify the axis to drop rows or columns, as well as the criteria for identifying missing values.

To check if there is a missing value on the dataset, use the IsNull and isNotNull functions.

# To check if there is a missing value -- ISNULL
display(df.filter(df['salary'].isNull()))

# To check non null values -- ISNOTNULL
display(df.filter(df['salary'].isNotNull()))

To drop the Null values, use the na.drop() function. Interchangeable with dropna() function

How parameter takes two values :

Any = drop a row if it contains ANY nulls

ALL = drop a row if the ENTIRE row has nulls

df.na.drop(how="any")
df.na.drop(how="all")

To drop rows that have less than thresh non-null values use the function below. This overwrites the "how parameter".

df.na.drop(thresh=2)

To impute the missing value with a specified value, use the na.fill() method.

Syntax for this method is : df.na.fill(new impute value, column(s) for the fill operation)

# If no column given, value will be imputed on the entire dataframe
# na.fill() and fillna() are interchangeable. 
df.na.fill('No address', 'address')

To impute NA values with central tendency measures (mean,mode,stddev ..), use imputer function like below.

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols = ['salary'],
    outputCols = ["{}_imputed".format(a) for a in ['salary']]
).setStrategy("mean")

# Fit and Transform the imputed mean values in the place of 
null values.
imputer.fit(df).transform(df)

It's important to choose the appropriate method for handling missing values based on the context and the needs of your analysis.

If you found this post useful, please don't forget to share and leave a comment a below.

Scrape your competitors website with one click

Learn DataBricks PySpark and SparkSQL with our Notebook

Become a Data Analyst with these FREE Courses

Use Knime to catch up with our Blogs

Don't miss out any Tweets with this Blog

Retail Store Clustering with just one click

How to create Date Table using Knime

How We Automated Our Twitter Promotion with Knime

How to create Delta Tables from Pyspark Dataframes

How to Use SQL Queries in Pyspark

Handling Duplicates with Pyspark

Handling Missing Values with Pyspark

Popular Posts

Blog Keywords

Blog Archive

Recent Posts

Views : Last Month

Translate Blogs to your Language

Social Profiles

Popular Posts

Blog Keywords

Blog Archive

Recent Posts

Views : Last Month

Translate Blogs to your Language