Retail Store Clustering with just one click

   Retail store clustering is a data analysis technique that groups similar stores based on various characteristics. It allows retailers to identify regional market trends and preferences, enabling them to tailor their product offerings and marketing strategies to local demand.

  • Better Understanding of Market Demands
  • Improved Store Planning and Operations
  • Better Resource Allocation
  • Improved Customer Experience
  • Competitive Advantage


   With our clustering solution, you can customize the inputs and assign weight to each input and analyze the results and iterate as many times to get a satisfactory result.


   In this post, we'll go over our Clustering Solution and what users can do to use their own data.

Click here to Download the Project from Official Knime Page

Click here to download the Sample Data


   First, a user should run the first step and fill the input sections with
data file location, cluster count and weight to each feature used in the model.


  In the second step, data will be loaded to Knime and different data sheets will be joined. In your dataset, it is important to keep the same sheet names with sample data since they are passed a sheet name parameter. 

  Then in the third step, you might want to exclude certain stores from the model because they might be outliers in terms of their features.


   
  In the fourth step, you might want to cluster the stores within their respective capacity group. For example, filtering only big stores will cluster only those stores and exclude others from the model. This might be useful if your allocation plans highly dependent on the store size.


  

   In the fifth step, user weights will be applied to features and several transformation steps will be applied and the final table will be sent to 1. and 2. model. 

   To assign weight to each unknown number of product category column, we use the Column List Loop Node

  This node allows us to rename each category column to anonymous column and apply the user weight to each column. With this node, no matter how many columns of product category you have, you will be able to assign weight to each of them.

Note: There might be some adjustments needed in some nodes if your dataset has  more features than sample dataset.  Please leave a comment or reach us if that's the case.


After we passed the data to each model, both K-means and Hierarchical model will be ready to use. 

In the sixth step, models will run and some groupings will be made to be used in results dashboards. 



The result dashboards, contains : 
  • Cluster Store Count Distribution
  • Store Count by Cluster and Store Capacity Group
  • Cluster - Store Details
  • Cluster Averages

Also, you can see the cluster store distribution on the map with the help of OSM Map View Node. 

  You can customize the map tooltip inside the 1. or 2. model by using the column filter node at top of the metanode.


You can also compare the performances of two model by checking their silhoutte scores .




You can also export the results to Excel  with Excel Writer Node for further analysis. 


In conclusion, retail store clustering is an essential tool for retailers to gain a competitive advantage and improve their overall performance. With our solution, you can easily cluster your stores with just a few clicks.

If you liked this project, please leave a comment below and share it on your socials.


Share:

How to create Date Table using Knime

 


In this post, we will show how to create date table with user specified start and end date, using Knime. 







   A date table is an essential component of any data analytics project, as it allows for the accurate and efficient querying of data over time. This table is typically used in conjunction with a fact table, which contains the numerical data that is being analyzed. The date table contains the dates and other time-based information, such as month, quarter, and year, that are used to filter and aggregate the data in the fact table.







   One of the main benefits of using a date table is that it allows for easy filtering of data by specific date ranges. For example, if you want to see sales data for a specific month, you can easily filter the data in the fact table by the corresponding month in the date table. This is much more efficient than trying to filter the data in the fact table by individual dates, as it would require a much more complex query.




   A date table also allows for easy forecasting and trend analysis. It makes it easy to calculate trends, rolling averages, and other important metrics. By having a separate table for dates, it makes it much simpler to query for data in a specific time frame, and also allows to do time based calculations like calculating year-over-year growth or monthly growth.




Column Expression node containing date metadata calculations




With our Knime Date Table Generator, you can create your own Custom Date Tables with all the main metadata included. In addition, you can add your own metadata fields using Column Expression node* to enrich your date table.



   In conclusion, using a date table in data analytics is a best practice that can greatly improve the efficiency, accuracy, and flexibility of your data analysis. It allows for easy filtering and aggregation of data, and enables powerful trend analysis and forecasting. It also ensures data quality by providing a consistent format for date and time information. 










Share:

How We Automated Our Twitter Promotion with Knime

 In this blog, we'll share how we automated our Twitter promotions with our Knime Workflow.



Click here to download the Workflow from Official Knime Page


   This workflow,  connects to the RSS feed of our blog to get some information like; Title, Date Published  and  URL.  

   Then it connects to Twitter, using Twitter Developer Account and Knime Twitter API. Uses the blog information and posts a promotion tweet for us with the custom hashtags we defined.


Here is the breakdown of the workflow...


   In the Table Creator node,  we paste the URL for the RSS feed and read the blog information with RSS Feed Reader node. (you might need to install this extension)


   To only promote the latest blogs, we filter the published date column with the date of today. ( we created a "today variable"  inside the metanode)


   Then we remove every columns except Title and Blog URL.  We use these two columns to create column variables. With the help of Group Loop node, we can pass the each Title and Blog URL to the Twitter Post Tweet Node and let the Knime do its magic and post the tweet for us.


    













We can customize the text, add more hashtags or links using this node.




Thanks to Knime, we don't have to manually promote our blogs. 


If you liked this content, please share and leave a comment below. 


Share:

How to create Delta Tables from Pyspark Dataframes

 

Delta tables are a new feature in Apache Spark that allow you to store large datasets as an efficient and scalable set of files on disk. Delta tables support fast, ACID transactions and provide efficient data management and access for data lakes and data pipelines.


To create a Delta table from a PySpark DataFrame, you can use the write.format() function with the "delta" option. For example:


df.write.format("delta").save("/path/to/delta/table")

This will create a Delta table at the specified path, and write the data from the DataFrame to the table.



You can also specify additional options when creating a Delta table. For example, you can use the mode option to specify whether to append the data to the table, overwrite the table, or throw an error if the table already exists. For example:

df.write.format("delta").mode("overwrite").save("/path/to/delta/table")




You can also use the partitionBy option to specify a column or set of columns to use as partition columns in the Delta table. For example:

df.write.format("delta").partitionBy("col1", "col2").save("/path/to/delta/table")


Write method has these default parameters :

DataFrameWriter.csv(
path,
mode=None,
compression=None,
sep=None,
quote=None,
header=None,
nullValue=None,
dateFormat=None,
encoding=None,
emptyValue=None,
lineSep=None
)



By using Delta tables, you can efficiently store and manage large datasets in Apache Spark, and take advantage of the scalability and performance of the Delta engine.



If you found this post useful, please don't forget to share and leave a comment a below.








Share:

How to Use SQL Queries in Pyspark

 


 PySpark is a popular open-source library for working with big data using Apache Spark. One of the key features of PySpark is the ability to write SQL queries to query data stored in a variety of data sources, including Apache Hive tables, Parquet files, and relational databases.

 


   To write SQL queries in PySpark, you can use the spark.sql() function, which allows you to execute SQL queries on data stored in a PySpark DataFrame. Here's an example of how to use the spark.sql() function to query a DataFrame:


from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("My App").getOrCreate()

# Read a DataFrame from a Hive table
df = spark.table("my_table")

# Execute a SQL query on the DataFrame
result = spark.sql("SELECT * FROM my_table WHERE col1 = 'value'")

# Print the results
result.show()


   In this example, we create a SparkSession and use it to read a DataFrame from a Hive table. We then use the spark.sql() function to execute a SQL query on the DataFrame, and print the results using the show() method.

 

   You can also use the spark.sql() function to execute queries on data stored in other data sources, such as Parquet files or relational databases. To do this, you can use the read.format() function to read the data into a DataFrame, and then pass the DataFrame to the spark.sql() function. For example:


# Read a DataFrame from a Parquet file
df = spark.read.format("parquet").load("/path/to/file.parquet")

# Execute a SQL query on the DataFrame
result = spark.sql("SELECT * FROM df WHERE col1 = 'value'")

# Print the results
result.show()


   By using the spark.sql() function, you can easily write and execute SQL queries on data stored in PySpark DataFrames, and leverage the power of SQL to work with big data in Python.


   If you found this post useful, please don't forget to share and leave a comment a below.



Share:

Handling Duplicates with Pyspark




   In PySpark, you can use the dropDuplicates() function to remove duplicate rows from a DataFrame. This function allows you to specify the columns to consider when identifying duplicates, as well as whether to keep the first or the last occurrence of each duplicate row.


However,  to check if there are any duplicates in your dataset, 

   Count the number of distinct rows in a subset and compare it with the number of total rows. If they're the same,  there is no duplicates in the selected subset. Otherwise, duplicates exists.


from pyspark.sql import functions as F
cols = ['employee_name', 'department']

counts_df = df.select([
    F.countDistinct(*cols).alias('n_unique'),
    F.count('*').alias('n_rows')])
n_unique, n_rows = counts_df.collect()[0]

if n_rows == n_unique:
    print(f' Total Row {n_rows} and Unique Row {n_unique} are equal. Therefore,
there are no duplicate rows for the given subset')
else:
    print(f' Total Row {n_rows} and Unique Row {n_unique} are not equal. Therefore,
there are {n_rows - n_unique}  duplicate rows for the given
subset \n Distinct Table below:')
 



   Here is an example of how to use dropDuplicates() to remove all duplicate rows from a DataFrame: 

df = df.dropDuplicates() # removes all duplicate rows


   You can also use the dropDuplicates() function to remove duplicates based on specific columns. For example:

df = df.dropDuplicates(subset=["col1", "col2"]) # removes duplicate rows based
on the values in columns "col1" and "col2"


   You can also use the dropDuplicates() function to keep the first or the last occurrence of each duplicate row. For example:

df = df.dropDuplicates(keep="first") # keeps the first occurrence of each duplicate row
df = df.dropDuplicates(keep="last") # keeps the last occurrence of each duplicate row


   It's important to choose the appropriate method for handling duplicate values based on the context and the needs of your analysis.


   If you found this post useful, please don't forget to share and leave a comment a below.




Share:

Handling Missing Values with Pyspark

 


   PySpark has some methods to handle missing values. You can use the dropna() function to remove rows or columns with missing values from a DataFrame. This function allows you to specify the axis to drop rows or columns, as well as the criteria for identifying missing values.


   To check if there is a missing value on the dataset, use the IsNull and isNotNull functions.

# To check if there is a missing value -- ISNULL
display(df.filter(df['salary'].isNull()))

# To check non null values -- ISNOTNULL
display(df.filter(df['salary'].isNotNull()))


   To drop the Null values, use the na.drop() function. Interchangeable with dropna() function

How parameter takes two values : 

Any  = drop a row if it contains ANY nulls

ALL = drop a row if the ENTIRE row has nulls

df.na.drop(how="any")
df.na.drop(how="all")



   To drop rows that have less than thresh non-null values use the function below. This overwrites the "how parameter".

df.na.drop(thresh=2)

To impute the missing value with a specified value, use the na.fill() method.

   Syntax for this method is :  df.na.fill(new impute value, column(s) for the fill operation)

# If no column given, value will be imputed on the entire dataframe
# na.fill() and fillna() are interchangeable.
df.na.fill('No address', 'address')


   To impute NA values with central tendency measures (mean,mode,stddev ..), use imputer function like below.

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols = ['salary'],
    outputCols = ["{}_imputed".format(a) for a in ['salary']]
).setStrategy("mean")

# Fit and Transform the imputed mean values in the place of
null values.
imputer.fit(df).transform(df)


   It's important to choose the appropriate method for handling missing values based on the context and the needs of your analysis.


If you found this post useful, please don't forget to share and leave a comment a below.



Share:

How to read flat files in Python Pandas

 

You can use the read_csv() function from the Pandas library to read a flat file in Python. This function can handle a wide variety of file formats, including CSV, TSV, and other types of delimiter-separated files.


Here is an example of how to use read_csv() to read a CSV file:

import pandas as pd

df = pd.read_csv("data.csv")


Here is an example of how to use read_csv() to read a TSV file:

import pandas as pd

df = pd.read_csv("data.tsv", sep="\t")


   You can also use the read_csv() function to specify additional options, such as the encoding of the file, the index column, or whether to skip rows. For example:


import pandas as pd

df = pd.read_csv("data.csv", encoding="utf-8", index_col=0, skiprows=[1, 2])

   

   For more information on the read_csv() function and its various options, you can refer to the Pandas documentation.


If you found this post useful, please don't forget to share and leave a comment a below.



Share:

SQL Serve Common Table Expressions (CTEs)

 



   A common table expression (CTE) is a temporary result set that you can reference within a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. CTEs are a powerful tool in SQL Server, as they allow you to define a reusable set of rows within a query, and can be used to simplify complex queries by breaking them up into smaller, more manageable pieces.


Here's an example of how to use a CTE in a SELECT statement:


WITH cte AS (
  SELECT col1, col2 FROM my_table WHERE col3 = 'value'
)
SELECT * FROM cte WHERE col1 > 10


    In this example, the CTE is defined as a SELECT statement that retrieves rows from the my_table table where col3 equals 'value'. The CTE is then referenced in the outer SELECT statement, which retrieves all rows from the CTE where col1 is greater than 10.


    CTEs can also be used in other types of statements, such as INSERT, UPDATE, DELETE, and CREATE VIEW. For example:


WITH cte AS (
  SELECT col1, col2 FROM my_table WHERE col3 = 'value'
)
INSERT INTO my_other_table (col1, col2)
SELECT col1, col2 FROM cte WHERE col1 > 10


   In this example, the CTE is used in an INSERT statement to insert rows into the my_other_table table.


   CTEs are a useful tool for simplifying and organizing complex queries in SQL Server. By breaking a query up into smaller, more manageable pieces, you can make your code easier to read and maintain.


   If you found this post useful, please don't forget to share and leave a comment a below.



Share:

SQL Server Window Functions

 




   Window functions are a powerful feature in SQL Server that allow you to perform calculations over a set of rows, or "window", in a SELECT statement. They are similar to aggregate functions, but unlike aggregate functions, which return a single value for a group of rows, window functions return a value for each row in the result set.


   To use a window function in SQL Server, you specify the function in the SELECT clause, followed by the OVER() clause, which defines the window over which the function will operate. Here's an example of how to use the RANK() function to rank the rows in a result set:


SELECT col1, col2, RANK() OVER (ORDER BY col1) AS rank
FROM my_table


    In this example, the RANK() function is used to rank the rows in the my_table table based on the values in the col1 column. The OVER() clause specifies that the function should operate over all rows in the table, and the ORDER BY clause specifies that the rows should be ranked in ascending order based on the values in col1.


   There are several other window functions available in SQL Server, including ROW_NUMBER(), DENSE_RANK(), and NTILE(). These functions can be used to perform a variety of calculations, such as ranking rows, numbering rows, and dividing rows into groups.


   Window functions are a powerful tool for performing calculations over a set of rows in SQL Server. By using window functions, you can easily perform complex calculations on your data without having to write lengthy, complex queries.


   If you found this post useful, please don't forget to share and leave a comment a below.


Share:

Coalesce and IsNull SQL Commands Comparison



   In SQL, the "COALESCE" and "ISNULL" functions are used to handle null values. Both functions allow you to specify a default value to use in place of nulls, but they work in slightly different ways.


The "COALESCE" function returns the first non-null value in a list of arguments. For example:

SELECT COALESCE(NULL, 'default value', 'another value')

This would return 'default value', because it is the first non-null value in the list.


   On the other hand, the "ISNULL" function returns a Boolean value indicating whether a value is null or not. For example:

SELECT ISNULL(NULL)  

This would return '1', because the value is null.


   So, the main difference between "COALESCE" and "ISNULL" is that "COALESCE" returns a default value for nulls, while "ISNULL" returns a Boolean value indicating whether a value is null or not.


   Both "COALESCE" and "ISNULL" can be useful in different situations. For example, you might use "COALESCE" to replace null values with a default value in a query, while you might use "ISNULL" to filter out null values or to perform a specific action when a value is null.


   It's important to choose the appropriate function based on your specific needs and the context of your query.


If you found this post useful, please don't forget to share and leave a comment a below.




Share:

Insert Into and Select Into SQL Commands Comparison

 




   In SQL Server, the "INSERT INTO" and "SELECT INTO" statements are used to insert data into a table. Both statements allow you to specify the source of the data and the destination table, but they work in slightly different ways.


   The "INSERT INTO" statement is used to insert data into an existing table. It allows you to specify the columns that you want to insert data into, as well as the values that you want to insert. For example:

INSERT INTO table_name (column1, column2, column3)
VALUES (value1, value2, value3)


   The "SELECT INTO" statement, on the other hand, is used to create a new table and insert data into it. It allows you to select data from one or more tables and insert it into a new table, which is automatically created based on the structure of the selected data. For example:

SELECT * INTO new_table_name
FROM table_name
WHERE condition


   So, the main difference between "INSERT INTO" and "SELECT INTO" is that "INSERT INTO" is used to insert data into an existing table, while "SELECT INTO" is used to create a new table and insert data into it.


   It's important to choose the appropriate statement based on your specific needs and the context of your query.


If you found this post useful, please don't forget to share and leave a comment a below.









Share:

SQL Server GroupBy Best Practices




 GROUP BY is an SQL clause that is used to group rows based on a common value or set of values. It is often used in conjunction with aggregate functions, such as SUM, AVG, and COUNT, to calculate values for each group.


   To ensure that your SQL Server GROUP BY queries are efficient and effective, it's important to follow these best practices:


Use appropriate indexes:

   Indexes can help to speed up GROUP BY queries by allowing the optimizer to locate the rows more efficiently. It's a good idea to create indexes on the columns that are used in the GROUP BY clause, as well as on any columns that are used in the SELECT, WHERE, or HAVING clauses.


Avoid using unnecessary columns:

   GROUP BY queries can be slower if they return a large number of columns, as this requires more data to be retrieved and processed. To improve performance, it's a good idea to only return the columns that are actually needed for the query.


Use appropriate data types:

   Using the right data types for your columns can also impact the performance of GROUP BY queries. In general, it's best to use the smallest data type that can hold the values in your column, as this can help to reduce the amount of data that needs to be processed.


Use appropriate aggregate functions:

   The aggregate function you use can also impact the performance of GROUP BY queries. In general, it's best to use the function that is most appropriate for your needs, as some functions may be faster than others.


   By following these best practices, you can optimize the performance of your SQL Server GROUP BY queries and ensure that they run as efficiently as possible.


If you found this post useful, please don't forget to share and leave a comment a below.


Share:

SQL Server Join Best Practices

 

image: Freepik.com

   Joins are an essential part of SQL and are used to combine data from two or more tables based on a common key. In SQL Server, there are several types of joins, including inner, left outer, right outer, and full outer joins.


   To get the best performance from your SQL Server joins, it's important to follow these best practices:


Use the right type of join:

   The type of join you use can have a significant impact on the performance of your query. Inner joins are generally the most efficient, as they only return rows that match the join condition. Outer joins, on the other hand, return all rows from one table, even if there is no match in the other table, which can be slower.


Use the right join order:

   The order in which you perform the joins can also affect performance. In general, it's best to start with the smallest table and work your way up, as this can reduce the number of rows that need to be processed in later joins.


Use appropriate join conditions:

   The join condition you use can also affect performance. Equijoins (joins with an "= "condition) are generally the most efficient, as they allow the optimizer to use an index to locate the matching rows. Non-equijoins (joins with other conditions) can be slower, as they require a full table scan or an index scan to locate the matching rows.


Use indexes:

Indexes can help to speed up joins by allowing the optimizer to locate matching rows more efficiently. It's a good idea to create indexes on the columns that are used in the join condition, as well as on any columns that are used in the SELECT, WHERE, or GROUP BY clauses.


By following these best practices, you can optimize the performance of your SQL Server joins and ensure that your queries run as efficiently as possible.


If you found this post useful, please don't forget to share and leave a comment a below.

Share:

Data Story Telling Best Practices


Designed by vectorjuice / Freepik


 Data storytelling is the art of using data to communicate insights and information in a clear and engaging way. It involves using a combination of data visualization, narrative, and context to tell a story with data that resonates with the audience and drives action.


Here are a few best practices for data storytelling:


Start with a clear purpose:

   Before you start telling your data story, it's important to have a clear purpose in mind. What do you want to communicate with your data, and why is it important? By having a clear purpose, you can ensure that your data story is focused and relevant.


Use data visualization effectively:

   Data visualization is a powerful tool for telling stories with data. It allows you to represent complex data in a way that is easy to understand and interpret. When creating visualizations, it's important to choose the right type of chart for the data, use clear labels and axes, and avoid cluttered or confusing designs.


Tell a story with your data:

   To create a compelling data story, you need to go beyond just presenting raw data. Use your data to tell a story that has a clear beginning, middle, and end, and that follows a logical sequence of events.


Use context to provide meaning:

   Data alone is often not enough to tell a complete story. It's important to provide context and background information that helps the audience understand the significance of the data and its implications.


Engage your audience:

   A good data story should be engaging and interactive, and should involve the audience in the story. Use techniques like asking questions, using interactive elements, and using storytelling techniques to keep the audience interested and involved.


   By following these best practices, you can create data stories that are clear, compelling, and effective at communicating insights and driving action.


If you found this post useful, please don't forget to share and leave a comment a below.




Share:

Power BI Data Modelling Best Practices

 





    Data modeling is the process of organizing and structuring data in a way that makes it easy to analyze and understand. In Power BI, data modeling is an important part of the report design process, and involves creating relationships, measures, and calculated columns to support analysis and visualization.

 

   To get the most out of your data model in Power BI, it's important to follow these best practices:

 

Define clear relationships:

   In Power BI, relationships define how tables are connected and how data is related to one another. It's important to define clear and accurate relationships between tables, as this will help to ensure that your data is correctly linked and that you get accurate results in your visualizations.

 


Use measures to calculate values:

   Measures are calculations that are defined in the data model and are used to perform aggregations and calculations on data. It's a good idea to use measures rather than calculated columns, as measures are more flexible and can be used in multiple places throughout your report.

 


Use calculated columns sparingly:

   Calculated columns are static calculations that are defined in the data model and are calculated at the time the data is loaded. While calculated columns can be useful in certain situations, it's generally a good idea to use them sparingly, as they can increase the size of your data model and make it more difficult to maintain.

 


Use DAX functions to improve performance:

   DAX (Data Analysis Expression) is a powerful expression language that is used in Power BI to create measures and calculated columns. By using DAX functions, you can improve the performance of your data model.

 


Use natural keys:

   Natural keys are columns that uniquely identify each row in a table and are typically used as the primary key in a table. When modeling your data in Power BI, it's best to use natural keys as the primary key, as this can help to improve the performance of your dataset and reduce the risk of errors.



Normalize your data:

   Normalization is the process of organizing your data into separate tables based on the relationships between the data. This helps to reduce redundancy and improve the efficiency of the model. In Power BI, you can use the Data Modeling view to create relationships between tables by dragging and dropping fields onto each other.

 


Use appropriate data types:

   Choosing the right data types for your fields is important for both the performance and the usability of your model. In general, it's best to use the smallest data type that can hold the values in your field. For example, if you have a field that only contains integers, you should use the "Integer" data type, rather than the "Whole Number" data type, which is larger.



If you found this post useful, please don't forget to share and leave a comment a below.


Share:

Popular Posts