Big Data - Helvetica Technical Consulting sagl

Big Data

Data Centres Water Requirement. From Cooling To Energy Consumption, Are They Sustainable?

Sep 13, 2023 by Helvetica Technical Consulting 0 Comments #big data, #data centers

A data center is a dedicated space in a building that houses computer systems and related components like storage and telecommunication systems. It comprises backup components and robust infrastructure for information exchange, power supply, security devices, and environmental control systems like fire suppression and air conditioning systems.

How Does It Work?

A data centre consists of virtual or physical servers (or robust computer systems) connected externally and internally through communication and networking equipment to store digital information and transfer it. It contains several components to serve different purposes:

Networking: It refers to the interconnections between a data center’s components and the outside world. It includes routers, app delivery controllers, firewalls, switches, etc.

Storage: An organization’s data is stored in data centres. The components for storage are tape drives, hard disk drives, solid-state drives (SSDs) with backups, etc.

Compute: It refers to the processing power and memory required to run applications. It is supplied through powerful computers to run applications.

Types of Data Centres

You can come across different types of data centres based on how they are owned, technologies used, and energy efficiency. Some of the main types of data centres that organizations use is:

Managed Data Centres

In a managed data centre, a third-party service provider offers computing, data storage, and other related services to organizations directly to help them run and manage their IT operations. The service provider deploys, monitors, and manages this data centre model, offering the features via a managed platform.

You can source the managed data centre services from a colocation facility, cloud data centres, or a fixed hosting site. A managed data centre can either be partially or fully managed. If it’s partially managed, the organization will have administration control over the data centre service and implementation. However, if it’s fully managed, all the back-end data and technical details are administered and controlled by the service provider.

Suitable for: The ideal users of managed data centres are medium to large businesses.

Benefits: You do not have to deal with regular maintenance, security, and other aspects. The data centre provider is responsible for maintaining network services and components, upgrading system-level programs and operating systems, and restoring service if anything goes wrong.

Enterprise Data Centres

An enterprise data centre refers to a private facility that supports the IT operations of a single organization. It can be situated at a site off-premises or on-premises based on their convenience. This type of data centre may consist of multiple data centres located at different global locations to support an organization’s key functions.

For example, if a business has customers from different global regions, they can set up data centres closer to their customers to enable faster service.

Enterprise data centres can have sub-data centres, such as:

Intranet controls data and applications within the main enterprise data centre. Enterprise uses the data for their research & development, marketing, manufacturing, and other functions.

Extranet performs business-to-business transactions inside the data centre network. The company accesses the services through VPNs or private WANs. The internet data centre is used to support servers and devices needed to run web applications.

Suitable for: As the name suggests, enterprise data centres are ideal for enterprises with global expansion and distinguished network requirements. It’s because they have enough revenue to support their data centres at multiple locations.

Benefits: It’s beneficial for businesses as it allows them to track critical parameters like power and bandwidth utilization and helps update their applications and systems. It also helps the companies understand their needs more and scale their capacities accordingly.

However, building enterprise data centre facilities needs heavy investments, maintenance needs, time, and effort.

Colocation Data Centres

A colocation data centre or “colo” is a facility that a business can rent from a data centre owner to enable IT operations to support applications, servers, and devices. It is becoming increasingly popular these days, especially for organizations that don’t have enough resources to build and manage a data centre of their own but still need it anyway. In a colo, you may use features and infrastructure such as building, security, bandwidth, equipment, and cooling systems. It helps connect network devices to different network and telecommunication service providers. The popularity of colocation facilities grew around the 2000s when organizations wanted to outsource some operations but with certain controls. Even if you rent some space from a data centre provider, your employees can still work within that space and even connect with other company servers.

Suitable for: Colocation data centres are suitable for medium to large businesses.

Benefits: There are several benefits that you can avail yourself from a colocation server, such as:

Scalability to support your business growth; you can add or remove servers and devices easily without hassles.

You will have the option to host the data centre at different global locations closest to your customers to offer the best experience.

Colocation data centres offer high reliability with powerful servers, computing power, and redundancy.

It also saves you money as you don’t have to build a large data centre from scratch at multiple locations. You can just rent it out based on your budget and present needs.

You don’t need to handle the data centre maintenance such as device installation, updates, power management, and other processes.

Cloud Data Centres

One of the most popular types of data centre these days is the cloud data centre. In this type, a cloud service provider runs and manages the data centre to support business applications and systems. It’s like a virtual data centre with even more benefits than colocation data centres.

The popular cloud service providers are Amazon AWS, Google, Microsoft Azure, Salesforce, etc. When data uploads in the cloud servers, the cloud service providers duplicate and fragment this data across multiple locations to ensure it’s never lost. They also back up your data, so you don’t lose it even if something goes wrong.

Now, cloud data centres can be of two types – public and private.

Public cloud providers like AWS and Azure offer resources through the internet to the public. Private cloud service providers offer customized cloud services. They give you singular access to private clouds (their cloud environment). Example: Salesforce CRM.

Suitable for: Cloud data centres are ideal for almost any organization of any type or scale.

Benefits: There are many benefits of using cloud data centres compared to physical or or-premise data centres, including:

It’s cost-effective as you don’t have to invest heavily in building a data centre from scratch. You just have to pay for the service you utilize and as long as you need it. You are free from maintenance requirements. They will take care of everything, from installing systems, upgrading software, and maintaining security to backups and cooling. It offers a flexible pricing plan. You can go for a monthly subscription and be aware of your expenditure in an easier way.

Edge Data Centres

The most recent of all, edge data centres are still in the development stage. They are smaller data centre facilities situated closer to the customers an organization serves. It utilizes the concept of edge computing by bringing the computation closer to systems that generate data to enable faster operations. Edge data centres are characterized by connectivity and size, allowing companies to deliver services and content to their local users at a greater speed and with minimal latency. They are connected to a central, large data centre or other data centres. In the future, edge data centres can support autonomous vehicles and IoT to offer higher processing power and improve the consumer experience.

Suitable for: Small to medium-sized businesses

Benefits: The benefits of using an edge data centre are:

An edge data centre can distribute high traffic loads efficiently. It can cache requested content and minimize the response time for a user request. It can also help increase network reliability by distributing traffic loads efficiently. The data centre offers a superb performance by placing computation closer to the source.

Hyperscale Data Centres

Hyperscale data centres are massive and house thousands of servers. They are designed to be highly scalable by adding more devices and equipment or increasing system power. The demand for hyper scale data centres is increasing with increasing data generation. Businesses now deal with an enormous amount of data, which begins to rise. Hence, to store and manage this sort of data, they need a giant data centre, and hyper scale seems to be the right choice for it.

Suitable for: Hyperscale data centres are best for large enterprises with massive amounts of data to store and manage.

Benefits: Initially, the data centre providers designed hyper scale data centres for large public cloud service providers. Although they can build it themselves, renting a hyper scale data centre comes with a lot of benefits:

It offers more flexibility; companies can scale up or down based on their current needs without any difficulties.

Increased speed to market so they can delight their customers with the best services. Freedom from maintenance needs, so they don’t waste time in repetitive work and dedicate that time to innovation. Other than these five main types of data centres, you may come across others as well. Let’s have a quick look at them.

Carrier hotels are the main internet exchange points for the entire data traffic belonging to a specific area. Carrier hotels focus on more fibre and telecom providers compared to a common colo. They are usually located downtown with a mature fibre infrastructure. However, creating a dense fibre system like this takes a great deal of effort and time, which is why they are rare. For example, One Wilshire in Los Angeles has 200+ carriers in the building to supply connectivity to the entire traffic coming from the US West Coast.

Microdata centre: It’s a condensed version of the edge data centre. It can be smaller, like an office room, to handle the data processing in a specific location.

Traditional data centres: They consisted of multiple servers in racks, performing different tasks. If you need more redundancy to manage your critical apps, you can add more servers to this rack. Starting around the 1990s, in this infrastructure, the service provider acquires, deploys, and maintains a server.

Over time, they add more servers to facilitate more capabilities. It needs monitoring the operating systems using monitoring tools, which requires a certain level of expertise. In addition, it requires patching and updating, and verifying them for security. All these require heavy investments, not to mention the powering and cooling cost is added extra.

Modular data centres: It’s a portable data centre, meaning you can deploy it at a place where you need data capacity. It contains modules and components offering scalability in addition to power and cooling capabilities. You can add modules, combine them with other modules or integrate them into a data centre.

Modular data centres can be of two types:

Containerized or portable: data centres arrange equipment into a shipping container that gets transported to a particular location. It has its own cooling systems.

Another type of modular data centre arranges equipment or devices into a capacity with prefabricated components. These components are quick to build on a location and added for more capacity.

What Are the Data Centre Tiers?

Another way of classifying data centres based on uptime and reliability is by data centre tiers. The Uptime Institute developed it during the 1990s, and there are 4 data centre tiers. Let us understand them.

Tier 1: A tier one data centre has “basic capacity” and includes a UPS. It has fewer components for redundancy and backup and a single path for cooling and power. It also involves higher downtime and may lack energy efficiency systems. It offers a minimum of 99.671% uptime, which means 28.8 hours of downtimes yearly.

Tier 2: A tier two data centre has “redundant capacity” and offers more components for redundancy and backup than tier 1. It also has a singular path for cooling and power. They are generally private data centres, and they also lack energy efficiency. Tier 2 data centres can offer a minimum of 99.741% uptime, which means 22 hours downtimes yearly.

Tier 3: A tier three data centre is “concurrently maintainable,” ensuring any component is safe to remove without impacting the process. It has different paths for cooling and power to help maintain and update the systems.

Tier 3 data centres have redundant systems to limit operational errors and equipment failure. They utilize UPS systems that supply power continuously to servers and backup generators. Therefore, they offer a minimum of 99.982% uptime, which means 1.6 hours of downtimes yearly and N+1 redundancy, higher than tiers 1 and 2.

Tier 4: A tier four data centre is “fault-tolerant” and allows a production capacity to be protected from any failure type. It requires twice the number of components, equipment, and resources to maintain a continuous flow of service even during disruptions.

Critical business operations from organizations that cannot afford downtimes use tier 4 data centres to offer the highest level of redundancy, uptime, and reliability. A tier 4 data centre provides a minimum of 99.995% uptime, which means 0.4 hours of annual downtime and 2N redundancy, which is superb.

Data centre water use

Total water consumption in the USA in 2015 was 1218 billion litres per day, of which thermoelectric power used 503 billion litres, irrigation used 446 billion litres and 147 billion litres per day went to supply 87% of the US population with potable water. Data centres consume water across two main categories: indirectly through electricity generation (traditionally thermoelectric power) and directly through cooling. In 2014, a total of 626 billion litres of water use was attributable to US data centres. This is a small proportion in the context of such high national figures; however, data centres compete with other users for access to local resources. A medium-sized data centre (15 megawatts (MW)) uses water as three average-sized hospitals, or more than two 18-hole golf courses. Progress has been made with using recycled and non-potable water, but from the limited figures available some data centre operators are drawing more than half of their water from potable sources. This has been the source of considerable controversy in areas of water stress and highlights the importance of understanding how data centres use water.

Water use in data centre cooling.

ICT equipment generates heat and so most devices must have a mechanism to manage their temperature. Drawing cool air over hot metal transfers heat energy to that air, which is then pushed out into the environment. This works because the computer temperature is usually higher than the surrounding air. The same process occurs in data centres, just at a larger scale. ICT equipment is located within a room or hall, heat is ejected from the equipment via an exhaust and that air is then extracted, cooled and recirculated. Data centre rooms are designed to operate within temperature ranges of 20–22 °C, with a lower bound of 12 °C. As temperatures increase, equipment failure rates also increase, although not necessarily linearly.

There are several different mechanisms for data centre cooling, but the general approach involves chillers reducing air temperature by cooling water—typically to 7–10 °C—which is then used as a heat transfer mechanism. Some data centres use cooling towers where external air travels across a wet media so the water evaporates. Fans expel the hot, wet air and the cooled water is recirculated. Other data centres use adiabatic economisers where water sprayed directly into the air flow, or onto a heat exchange surface, cools the air entering the data centre. With both techniques, the evaporation results in water loss. A small 1 MW data centre using one of these types of traditional cooling can use around 25.5 million litres of water per year.

Cooling the water is the main source of energy consumption. Raising the chiller water temperature from the usual 7–10 °C to 18–20 °C can reduce expenses by 40% due to the reduced temperature difference between the water and the air. Costs depend on the seasonal ambient temperature of the data centre location. In cooler regions, less cooling is required, and instead free air cooling can draw in cold air from the external environment. This also means smaller chillers can be used, reducing capital expenditure by up to 30%. Both Google and Microsoft have built data centres without chillers, but this is difficult in hot regions.

Alternative water sources

Where data centres own and operate the entire facility, there is more flexibility for exploring alternative sources of water, and different techniques for keeping ICT equipment cool.

Google’s Hamina data centre in Finland has used sea water for cooling since it opened in 2011. Using existing pipes from when the facility was a paper mill, the cold sea water is pumped into heat exchangers within the data centre. The sea water is kept separate from the freshwater, which circulates within the heat exchangers. When expelled, the hot water is mixed with cold sea water before being returned to the sea.

Despite Amazon’s poor environmental efforts in comparison to Google and Microsoft, they are expanding their use of non-potable water. Data centre operators have a history of using drinking water for cooling, and most source their water from reservoirs because access to rainfall, grey water and surface water is seen as unreliable. Digital Realty, a large global data centre operator, is one of the few companies publishing a water source breakdown. Reducing this proportion is important because the processing and filtering requirements of drinking water increase the lifecycle energy footprint. The embodied energy in the manufacturing of any chemicals required for filtering must also be considered. This increases the overall carbon footprint of a data centre.

Amazon claims to be the first data centre operator approved for using recycled water for direct evaporative cooling. Deployed in their data centres in Northern Virginia and Oregon, they also have plans to retrofit facilities in Northern California. However, Digital Realty faced delays when working with a local utility in Los Angeles because they needed a new pipeline to pump recycled water to its data centres.

Microsoft’s Project Natick is a different attempt to tackle this challenge by submerging a sealed data centre under water. Tests concluded off the Orkney Islands in 2020 showed that 864 servers could run reliably for 2 years with cooling provided by the ambient sea temperature, and electricity from local renewable sources. The potential to make use of natural cooling is encouraging, however, the small scale of these systems could mean higher costs, making them appropriate only for certain high-value use cases.

ICT equipment is deployed in racks, aligned in rows, within a data centre room. Traditional cooling manages the temperature of the room as a whole, however, this is not as efficient as more targeted cooling. Moving from cooling the entire room to focused cooling of a row of servers, or even a specific rack, can achieve energy savings of up to 29%, and is the subject of a Google patent granted in 2012.

This is becoming necessary because of the increase in rack density. Microsoft is deploying new hardware such as the Nvidia DGX-2 Graphics Processing Unit that consumes 10 kW for machine learning workloads, and existing cooling techniques are proving insufficient. Using low-boiling-point liquids is more efficient than using ambient air cooling and past experiments have shown that a super-computing system can transfer 96% of excess heat to water, with 45% less heat transferred to the ambient air. Microsoft is now testing these techniques in its cloud data centres.

These projects show promise for the future, but there are still gains to be had from existing infrastructure. Google has used its AI expertise to reduce energy use from cooling by up to 40% through hourly adjustments to environmental controls based on predicted weather, internal temperatures and pressure within its existing data centres. Another idea is to co-locate data centres and desalination facilities so they can share energy intensive operations68. That most of the innovation is now led by the big three cloud providers demonstrates their scale advantage. By owning, managing and controlling the entire value chain from server design through to the location of the building, cloud vendors have been able to push data centre efficiency to levels impossible for more traditional operators to achieve.

However, only the largest providers build their own data centres, and often work with other data centre operators in smaller regions. For example, as of the end of 2020, Google lists 21 data centres, publishes PUE for 17, but has over 100 points of presence (PoPs) around the world. These PoPs are used to provide services closer to its users, for example, to provide faster load times when streaming YouTube videos. Whilst Google owns the equipment deployed in the PoP, it does not have the same level of control as it does when it designs and builds its own data centres. Even so, Google has explored efficiency improvements such as optimising air venting, increasing temperature from 22 to 27 °C, deployed plastic curtains to establish cool aisles for more heat sensitive equipment and improved the design of air conditioning return air flow. In a case study for one its PoPs, this work was shown to reduce PUE from 2.4 to 1.7 and saved US$67,000 per year in energy for a cost of US$25,000.

References

Data Centre Types Explained in 5 Minutes or Less (geekflare.com)

Data centre water consumption | npj Clean Water (nature.com)

Drought-stricken communities push back against data centres (nbcnews.com)

Our commitment to climate-conscious data centre cooling (blog.google)

Water Usage Effectiveness For Data Centre Sustainability – AKCP

Big Data

Vending Machine – Data Analysis – From study to action and how to improve performance

Aug 3, 2023 by Helvetica Technical Consulting 0 Comments #big data, #data scientist, #supervised learning

Data analysis of A vending machine can be very helpful because the information given after trasforming & visusalizing data can enhance logistics, avoid losses and improve performance.

A vending machine is one of those machines installed in shopping mall, offices and stores. They can sell anything which is inside. Any item is stored in a coil and can be bought at a fixed price.

The new models allows to collects usefull data in csv format and then can be manipulated in a way that can give a lot of information like customer profile, spending, preferences and also to discover some correlation between two or more products are sold together.

This study collects data from a single vending machine and try to analyse and search for some correlation between items sold.

Data consist in a single file with 6445 rows and 16 columns. Rows corresponds to a a single operation, from January to August. Most important columns for this study correspond to:

Name
DateofSale: day,month, day number, year
Type of Food: Carbonated, Non-Carbonated, Food, Water
Type of Payment: credit card, cash
RCoil: coil number of the product
RPrice: price of the product in the coil
QtySold: quantity sold
TransTotal: total amount of the transaction. Normally 1 sold, 1 paid, but can happens that more than item can be sold

Preprocessing

Data is loaded as follow, removing unnecessary fields from raw data:
After cleaning and transforming data, the following table shows the entire dataset consisting of 6445 rows and 10 columns.

The first thing to do, is a preliminary calculation to see which categories are present in the dataset.

We can see that the 2 most important categories are food and carbonated drinks, which correspond to 78% of total transactions in 8 months of sampling. In the following sections we will go deep into data analytics

Carbonated

The following table corresponds Carbonated products and quantity sold, sorted from hightest to lowest:

The first 5 positions, corresponding to 37% of the types of carbonated drink, sold 1431

Food

The following table corresponds Food products and quantity sold, sorted from hightest to lowest

In case of food, the first 5 position covers only 23% of the total quantity sold, in addition to this the number of categories/brands is 7 times bigger than carbonated drinks. This creates a spread in the sales because the user/client has more types to choose. The above short section shown the data extracted from the main dataset that is usefull to provide an indication of trending products. The information given is without any statistical inference, but merely data extracted, loaded and transformed (ELT).

Monthly sales

If you want to see the overall study and discover if there is a correlation between a carbonated drink is sold with food, you can find it below

Big Data

Population and houses growth in Switzerland

Mar 22, 2023 by Helvetica Technical Consulting 2 Comments

Switzerland is known for its high standard of living and picturesque landscapes, making it a popular destination for expats, students, and travelers. However, it is also known for its high cost of living, including housing prices. Renting a flat in Switzerland can be expensive, especially in larger cities such as Zurich, Geneva, and Basel.

The scope of this article is a study to correlate the prices of the house in francs/m2 and correlating them with population. The data used is provided by opendata.swiss and the information of this paper is free of charge.

Data Mining & Preprocessing

All data used in this study was retrieved from opendata.swiss which is the Swiss public administration’s central portal for open government data.

Several files with CSV and XLS extensions were used and adapted to provide a full dataset of information regarding population growth, buildings construction and price variation through the years.

Population data set cover 1950-2020, classified by sex, provenience & canton

Building construction dataset on the other side starts in 2003 to 2020 classified by flat or building & canton

Last set is about price per m2 in swiss francs. This set starts in 2012 until 2020 classified by canton & year of construction, from older than 1919 up to 2021. For our purpose, average through canton value was used in order to homogenize data accross years and building age.

Population data was truncated to start in 2003 to match building construction data set.

Analysis

For the analysis few statistical indicatore were used:

Arithmetic mean, also known as the average, is a measure of central tendency that represents the typical value of a set of numbers. It is calculated by adding up all the values in a set and then dividing the sum by the number of values in the set. The arithmetic mean is commonly used in statistics to summarize the data and to compare different sets of data. It is a useful measure of central tendency when the data is evenly distributed and does not have any extreme outliers. However, it can be influenced by outliers, and in such cases, other measures of central tendency such as the median or mode may be more appropriate. Defined as:

Standard deviation, The standard deviation is a measure of the amount of variation or dispersion in a set of data. It is calculated as the square root of the variance, which is the average of the squared differences of each value from the mean. The standard deviation is commonly used in statistics to describe the spread of a distribution, with a higher standard deviation indicating a wider spread of values and a lower standard deviation indicating a narrower spread of values. It is also used in inferential statistics to calculate confidence intervals and to test hypotheses about the population from which the sample was drawn. Defined as:

After calculations, graphs were constructed to visualize data and get information.

Population Data

Data recall population from year 1950 until 2020. After importing data, it is usefull to display visual information of total values both for sex and citizenship. The final graph after filtering data is as follows:

Adding a linear trend, gives that in 2030 the population will be around 9 millions.

To have further detail on population, it is possible to use population change by canton using standard deviation, to see data variation through the years.

Higher values means high variation in positive(growing) direction

Houses Data

Data about constructions in switzerland is imported. This data covers from 2003 to 2020.

It is clearly visible that the number of new construction reaches its peak in 2015 and then change its direction to the lower values.

Rent average price m2/chf

Data is categorized by canton and year, from 2012 to 2020 and the value is expressed as average through 26 cantons. Due to lowering number of new construction, one can say that prices will growth. For this reason, this dataset can be usefull to study if there are some variation in the prices. Note that this values includes existing buildings and new constructed. Original dataset considers building older that 1919 up to 2021. For practical purposes, data was filtered.

To have a better understanding, difference between 2012 and 2020 prices is summarized and plotted as follow:

Conclusion

The highest price deviation are the AI, Appenzeller Inner, second places is for BS, Basel City and third place is GL, Glarus cantons. . On the other hand, Basel Stadt has the higher variation in the prices, passing from 16.90 chf/m2 to 18.2 chf/m2.

Zurich city which has the highest population increase during the last 20 years, don’t show a proporcional increase in the price, passing 18.5 to 19.3 chf/m2.

A note from last graph is about zug that the price does not changes over 8 years, while Grisons and Schwyz the prices are lower the befor

It is worth to recall that prices are on average basis for all houses present in the canton and the price is referred only to rent, other expenses are not included like common heating, waste, cleaning, parking and other amenities.

Population-and-houses-growth-in-Switzerland

Average rent in Swiss francs according to the number of rooms and the canton | opendata.swiss

Demographic evolution, 1950-2021 | opendata.swiss

Average rent per m2 in Swiss francs according to the age of construction and the canton | opendata.swiss

Big Data

Data Scientist: Current State and Future Trend, a new role for the future

Mar 6, 2023 by Helvetica Technical Consulting 0 Comments

The field of data science has exploded in recent years, with demand for skilled professionals at an all-time high. Universities around the world now offer a variety of courses and degree programs in data science, including both undergraduate and graduate options. Online learning platforms such as Coursera, Udacity, and edX offer massive open online courses (MOOCs) in data science, allowing individuals to gain valuable knowledge and skills without enrolling in a full-time program.

While many universities and online courses provide a solid foundation in data science, heuristic knowledge gained through practical experience is equally important. Data scientists must have strong programming skills, as well as expertise in statistical analysis, machine learning, and data visualization. Effective communication skills are also essential, as data scientists must be able to explain their findings to both technical and non-technical stakeholders.

Some minimum requirements for a career in data science include a bachelor’s degree in a related field such as computer science, statistics, or mathematics, as well as experience with programming languages such as Python or R. However, many employers now require advanced degrees and significant work experience in the field.

According to the Bureau of Labor Statistics, the demand for data scientists is projected to grow by 16% between 2020 and 2030. This growth is expected to be driven by increasing demand for data-driven decision-making across industries. The field of data science is continually evolving, and professionals must keep up with the latest developments and technologies to stay competitive.

In addition to traditional data science roles, there are also emerging areas of specialization within the field, such as data engineering, data visualization, and data journalism. These specializations offer opportunities for individuals to focus on specific aspects of data science and develop expertise in a particular area.

In conclusion, data science is a rapidly growing field with strong demand for skilled professionals. While universities and online courses provide a foundation, practical experience and heuristic knowledge are equally important. Effective communication and programming skills are essential, and advanced degrees and work experience are increasingly required. With the continued growth of data-driven decision-making, the demand for data science professionals is expected to remain high.

Big Data

Key Distinctions between Scientists and Engineer, to empower Data Analytics

Mar 6, 2023 by Helvetica Technical Consulting 0 Comments

Data analytics is a growing field, where data scientists and engineers are crucial for its success. Both roles involve working with data, but have distinct responsibilities. Science is more like research, while data engineering is more like development. The first analyze data to extract insights and make predictions, while data engineers design and maintain systems to enable data scientists to work with data.

Data scientists ask the right questions and find meaningful insights from data, while data engineers build and maintain the infrastructure. Engineering involves building the infrastructure to support data science, while data science involves using that infrastructure to extract insights to make data usable, while data science makes sense of it.

Both data scientists and data engineers have strong employment prospects. The demand for data scientists is projected to grow by 16% between 2020 and 2030, and for computer and information technology occupations, which include data engineers, by 11%. The increasing importance of data-driven decision making across industries means that the demand for both roles will continue to rise.

If you want to become a data engineer or data scientist, there are various educational paths to take. Many universities offer undergraduate and graduate programs in data science, computer science, or related fields. Additionally, various online courses and bootcamps offer training in data analytics, machine learning, and other relevant skills.

Data science and data engineering have vast and varied applications. In healthcare, data analytics improves patient outcomes and streamlines processes. In finance, data analytics detects fraud and predicts market trends. In retail, data analytics personalizes marketing campaigns and optimizes supply chain operations. Data science and data engineering drive innovation and create value across industries.

Conclusion

In conclusion, data scientists and data engineers are critical for data analytics success, with essential, distinct responsibilities. The demand for both roles will continue to increase, as data-driven decision making becomes more important. Pursuing a career in data analytics offers various educational paths and fields of application to explore.

Further resources

“Python Data Science Handbook” by Jake VanderPlas: https://jakevdp.github.io/PythonDataScienceHandbook/
“Data Science Essentials” by Microsoft: https://docs.microsoft.com/en-us/learn/paths/data-science-essentials/
“Data Engineering Cookbook” by O’Reilly Media: https://www.oreilly.com/library/view/data-engineering-cookbook/9781492071424/
“Data Science for Business” by Foster Provost and Tom Fawcett: https://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323
“Data Engineering on Google Cloud Platform” by Google Cloud: https://cloud.google.com/solutions/data-engineering/
“Applied Data Science with Python” by Coursera: https://www.coursera.org/specializations/data-science-python

Big Data

Which Is The Difference Between Data Scientist And Data Engineer?

Feb 28, 2023 by Helvetica Technical Consulting 0 Comments #big data, #data, #data engineer, #data scientist

Data scientist and data engineer are both essential roles in the field of data analytics, but they have distinct responsibilities. According to Max Shron in “Thinking with Data: How to Turn Information into Insights,” “data science is more like a research project, while data engineering is more like a development project.” This means that while data scientists focus on analyzing data to extract insights and make predictions, data engineers are responsible for designing and maintaining the systems that enable data scientists to work with the data.

Andreas Müller and Sarah Guido echo this sentiment in “Introduction to Machine Learning with Python: A Guide for Data Scientists,” stating that “data scientists are concerned with asking the right questions and finding meaningful insights from data. Data engineers are responsible for designing and maintaining the systems that enable data scientists to work with the data.” DJ Patil and Hilary Mason similarly note in “Data Driven: Creating a Data Culture” that “data engineering involves building the infrastructure to support data science, while data science involves using that infrastructure to extract insights from data.”

Joel Grus adds in “Data Science from Scratch: First Principles with Python” that “data engineering involves building the infrastructure to support data science, while data science involves using that infrastructure to extract insights from data.” Finally, Martin Kleppmann sums it up in “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems” by saying that “data science is about making sense of data, while data engineering is about making data make sense.”

In summary, data scientists focus on extracting insights from data, while data engineers focus on building the infrastructure to store and process that data. While there may be some overlap between the roles, they have distinct responsibilities and focus on different aspects of working with data. Both roles are crucial in modern data-driven organizations, and they often work together closely to achieve common goals

Big Data

Rock Music is Alive and Powerful! Statistics from 1950 and 2020

Jan 11, 2023 by Helvetica Technical Consulting 5 Comments #big data, #rock music, #statistics

This article was done to get some statistics about rock music and what big data analysis can do to gather or discover hidden useful information.

The following analysis gets the data from Kaggle, free license

What is Kaggle? According to online definitions, Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. inside the website can be found courses, datasets, contest/challenges including money.

Dataset can be uploaded by single usernames or by companies during a competition.

Scope of the Study

A lot of considerations can be made from the history of rock music, but the scope of this study is to support the changes that music rock did during the years.

Rock music, as an alternative of pop music (intended as common or soft) in the beginning was an underground music that gained fame during the years, with a constant increase. Some people or critics claim that rock is dead, but we will seek if there is a truth on this sentence.

Data

Dataset is from 2020 retrieved from spotify covering rock songs from 1950 to 2020 with 5484 songs and 17 tags/label to identify and classify a song. From the tag list, only popularity is an index from the audience feedback while the remaining tags describe the song characteristics.

Index
Name: Song’s name
Artist
Release date
Length: in minutes
Popularity: A value from 0 to 100
Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
Energy: Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”.
Key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
Speechiness: This detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
Time Signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
Valence: Describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Popularity requires some clarification from analytical point of view and need some assumptions. We don’t know when the popularity was measured, monthly or yearly, and also in which year. Considering this lack of information, we will assume likelihood that popularity was calculed in 2020 when considering songs from 1950 to 2019.

Data Pre-processing & Feature Engineering

After loading the data, we need to manipulate it according to our scope of the study, more specifically we will count the letters both in the artist’s name and song’s name.

The name of the song contains some noise created by the versions mastered or remastered. this creates a distortion in the real name of the song. Most of time, remastering a song has the only effect to clean using new technologies and also to refresh the mind of people.

Since there are 5848 rows in the data, this creates a lot of noise, so the best way for filtering data, is to preprocesssing in aggregated way following statistical parameters, mean, max & min of the values for each year from 1956 to 2020. This leads to a new data set of 65 rows where every row is one year.

Below you can find complete pdf.

historyofrock

Big Data

Wind energy or Photovoltaic, which is the best?

May 9, 2022 by Helvetica Technical Consulting 0 Comments

Nowadays there are several sources that are not affected by quantity limitations because are renewed by the nature in different way, they are the renewable energies. The two most common and developed are wind energy (on-shore and off-shore) and solar (as photovoltaic), with some advantages according to the location where they will be placed, both are governed by environmental conditions.

Wind energy has the greatest advantage high energy density which relates to energy production in kWh and land occupied in sqm because they are developed vertically, on the other hand wind turbines can handle wind speeds from 3 m/s (cut-in) to 25 m/s (cut-out parameter).

Solar, photovoltaic energy has the greatest advantage to be cheaper than wind turbine, no cut-out parameter but they require more land area to achieve same energy production.

Over the years, cost of kW installed of both systems falls down until now when photovoltaic is cheaper the onshore wind energy, passing from barely 5000 usd/kW to less then 1000 usd in 10 years, while wind energy decreasing is less pronounced.

One the parameter that can help the reader to get a clear information of performance difference of both sources is the capacity factor.

Can be defined as the

unitless ratio of actual electrical energy output over a given period of time divided by the theoretical continuous maximum electrical energy output over that period.

The graph shows the capacity factor of all renewable energy sources and can be noticed that photovoltaic has a lower capacity than wind but at a cost that is 3 times lower than offshore wind and 0.5 times lower than onshore wind.

Big Data

What is Data Driven Decision Making? A quick intro

Apr 27, 2022 by Helvetica Technical Consulting 0 Comments

Introduction

Data driven decision-making has become a buzzword in today’s business world. Companies are using data and analytics to drive their decision-making processes and gain insights into their operations. This approach allows them to optimize their processes, reduce costs, and increase revenue. In this article, we will delve into the concept of data-driven decision-making and its importance in companies. We will also explore the works of Erik Brynjolfsson, DJ Patil, and Hilary Mason, who have made significant contributions to the field.

What is Data-Driven Decision-Making?

Data-driven decision-making involves using data and analytics to drive business decisions. It is a process that involves collecting, analyzing, and interpreting data to gain insights into operations and identify patterns. By doing so, companies can make informed decisions that lead to better outcomes.

The process of data-driven decision-making involves several steps. First, data is collected from various sources, such as customer feedback, sales data, and operational data. The data is then cleaned and transformed into a format that can be analyzed. Once the data is prepared, it is analyzed using statistical methods to identify patterns and trends. Finally, the insights gained from the data analysis are used to make informed decisions.

Why is Data-Driven Decision-Making Important?

Data-driven decision-making has several benefits for companies. First, it allows them to optimize their operations and reduce costs. By analyzing data, companies can identify inefficiencies in their operations and take steps to improve them. This can lead to cost savings and increased profitability.

Second, data-driven decision-making can help companies to identify opportunities for growth and innovation. By analyzing customer data, companies can identify trends and develop new products and services that meet the needs of their customers. This can lead to increased revenue and market share.

Finally, data-driven decision-making can improve customer experience. By analyzing customer data, companies can gain insights into customer behavior and preferences. This can help them to tailor their products and services to better meet the needs of their customers, leading to increased customer satisfaction and loyalty.

Erik Brynjolfsson and Data-Driven Decision-Making

Erik Brynjolfsson is a renowned economist and Professor of Management at the Massachusetts Institute of Technology (MIT). He is a leading authority on the economics of information technology and has made significant contributions to the field of data-driven decision-making.

In a 2011 paper titled “Big Data: The Management Revolution,” Brynjolfsson and his co-author Andrew McAfee argued that data-driven decision-making was transforming business operations. They highlighted the importance of data-driven decision-making in improving operational efficiency and driving innovation.

The authors noted that companies that were data-driven were more likely to be successful in the long run. They cited examples of companies like Google, Amazon, and Netflix, who had embraced data-driven decision-making and achieved great success.

Brynjolfsson and McAfee argued that data-driven decision-making was becoming more accessible to companies of all sizes. They noted that the cost of data storage and processing had decreased significantly, making it easier for companies to collect and analyze data.

The authors also cautioned that data-driven decision-making was not a silver bullet. They noted that companies needed to have the right infrastructure, talent, and culture to make data-driven decisions successfully.

DJ Patil and Data-Driven Decision-Making

DJ Patil is a data scientist and entrepreneur who has worked for companies like LinkedIn, Greylock Partners, and the US government. He is known for his contributions to the field of data science and data-driven decision-making.

Patil has emphasized the importance of data culture in companies. He argues that companies need to develop a culture that values data and encourages data-driven decision-making. This involves creating a data-driven mindset among employees and promoting data literacy across the organization.

Patil also notes that companies need to invest in data infrastructure and technology. This includes data storage, processing, and analysis tools that enable companies to collect, clean, and analyze large amounts of data.

In a 2014 paper titled “Building Data Science Teams,” Patil emphasized the importance of collaboration in data-driven decision-making. He notes that data science teams need to work closely with business stakeholders to understand their needs and develop data-driven solutions that address those needs.

Patil also highlights the importance of experimentation in data-driven decision-making. He notes that companies need to be willing to experiment with new ideas and approaches, and to learn from their failures as well as their successes. This requires a culture of innovation and risk-taking, where failure is seen as an opportunity to learn and improve.

Hilary Mason and Data-Driven Decision-Making

Hilary Mason is a data scientist and entrepreneur who has worked for companies like Bitly and Fast Forward Labs. She is known for her contributions to the field of data science and her advocacy for data-driven decision-making.

Mason has emphasized the importance of data storytelling in data-driven decision-making. She argues that data needs to be presented in a way that is meaningful and engaging to stakeholders. This requires data scientists to have strong communication skills and the ability to tell compelling stories with data.

Mason also notes that companies need to focus on the right data. She argues that companies should prioritize data that is relevant to their business goals and objectives, rather than collecting data for the sake of collecting it. This requires companies to have a clear understanding of their business needs and to align their data collection efforts with those needs.

In a 2014 TED talk titled “The Urgency of Curating Data,” Mason emphasized the importance of data curation in data-driven decision-making. She notes that data needs to be curated and maintained to ensure its accuracy and reliability. This requires companies to invest in data governance and quality control processes, and to ensure that data is being used in a responsible and ethical manner.

Examples of Data-Driven Decision-Making

Data-driven decision-making has become increasingly common in companies across various industries. Here are a few examples of how companies are using data to drive their decision-making processes:

Netflix: Netflix is a prime example of a company that has embraced data-driven decision-making. The company uses data to personalize its content recommendations and to develop new content that meets the needs and preferences of its viewers. Netflix also uses data to optimize its operations and to improve customer experience.

Amazon: Amazon is another company that has leveraged data to drive its decision-making processes. The company uses data to optimize its supply chain and to improve its logistics operations. Amazon also uses data to personalize its product recommendations and to develop new products and services that meet the needs of its customers.

Ford: Ford is using data to drive its innovation efforts. The company is collecting data from its connected cars to gain insights into customer behavior and preferences. This data is being used to develop new products and services that meet the needs of Ford’s customers.

Conclusion

Data-driven decision-making has become essential in today’s business world. Companies that embrace data-driven decision-making are more likely to succeed in the long run, as they can optimize their operations, identify opportunities for growth and innovation, and improve customer experience. Erik Brynjolfsson, DJ Patil, and Hilary Mason have made significant contributions to the field of data-driven decision-making, emphasizing the importance of data culture, collaboration, storytelling, and curation. Examples of companies like Netflix, Amazon, and Ford show how data-driven decision-making is transforming business operations and driving innovation. As data becomes increasingly important in business decision-making, companies that can effectively collect, analyze, and interpret data will have a significant competitive advantage.

Big Data

Natural Gas in Italy, a deep insight into the market

Apr 20, 2022 by Helvetica Technical Consulting 0 Comments

After several months of research, we are happy to announce our first report about LNG & Natural Gas energy in Italy. This report is the result of thorough research, data analysis, and consultations with experts in the energy sector. With over 60 pages filled with graphs, tables, and useful information, this report serves as a valuable tool for journalists, data-driven companies, and market insiders.

Natural gas is a significant source of energy in Italy, accounting for over 30% of the country’s total energy consumption. Italy is the third-largest natural gas consumer in Europe, after Germany and the United Kingdom. The country’s high dependence on natural gas has been driven by a combination of factors, including its role as a transitional fuel towards decarbonization, its flexibility in balancing intermittent renewable energy sources, and its relatively low carbon intensity compared to other fossil fuels.

The Italian natural gas market is characterized by a high level of integration with the European market, with cross-border pipelines connecting Italy to several neighboring countries, including France, Switzerland, and Austria. The country also has access to liquefied natural gas (LNG) through several import terminals located along the coast. These terminals receive LNG shipments from countries such as Qatar, Algeria, and Nigeria.

One of the key drivers of the Italian natural gas market is the power sector, which accounts for over 40% of the country’s total gas consumption. Natural gas is widely used for electricity generation, both in combined cycle gas turbines (CCGTs) and open cycle gas turbines (OCGTs). The use of natural gas in power generation is driven by its flexibility, low emissions, and relatively low cost compared to other fossil fuels.

Another important sector for natural gas in Italy is the residential and commercial sector, which accounts for around 30% of the country’s total gas consumption. Natural gas is widely used for space heating, hot water production, and cooking in households and commercial buildings. The use of natural gas in the residential and commercial sector is driven by its convenience, low emissions, and relatively low cost compared to other fuels such as oil and propane.

The industrial sector is another important consumer of natural gas in Italy, accounting for around 25% of the country’s total gas consumption. Natural gas is widely used in the industrial sector for process heat, steam production, and as a feedstock for the production of chemicals and fertilizers. The use of natural gas in the industrial sector is driven by its reliability, flexibility, and relatively low cost compared to other fuels such as coal and oil.

The Italian natural gas market is highly competitive, with several players operating in different segments of the value chain. The upstream segment is dominated by ENI, the country’s largest integrated energy company, which has a significant presence in the exploration and production of natural gas both in Italy and abroad. Other important players in the upstream segment include Edison, TotalEnergies, and Shell.

The midstream segment of the natural gas value chain in Italy is characterized by a high degree of infrastructure development, including pipelines, storage facilities, and LNG terminals. The infrastructure is operated by several players, including Snam, the country’s largest natural gas infrastructure company, and international players such as Fluxys, GRTgaz, and Trans Austria Gasleitung.

The downstream segment of the natural gas value chain in Italy is characterized by a high level of competition among gas distributors and retailers. The gas distribution network in Italy is owned and operated by several companies, including Snam, Italgas, and Hera. Retailers compete with each other to offer natural gas to residential and commercial customers, with players such as Enel Energia, Eni Gas e Luce, and Edison Energia.

Despite the significant role of natural gas in Italy’s energy mix, the country faces several challenges related to its energy transition. The transition towards a more sustainable and low-carbon energy system is a priority for Italy, which aims to achieve carbon neutrality by 2050. To achieve this goal, the country needs to reduce its dependence on fossil fuels, including natural gas, and increase the use of renewable energy sources such as solar, wind, and hydropower.

One of the main challenges for Italy’s energy transition is the need to ensure energy security and affordability while reducing greenhouse gas emissions. The country’s reliance on natural gas as a transitional fuel presents a trade-off between reducing emissions in the short term and achieving long-term decarbonization goals. To address this challenge, Italy needs to accelerate the deployment of renewable energy sources, improve energy efficiency, and develop new technologies to enable the decarbonization of the natural gas sector, such as carbon capture and storage (CCS) and hydrogen production.

Another challenge for Italy’s energy transition is the need to address the social and economic impacts of the transition, particularly in regions that are heavily dependent on fossil fuels. The closure of coal-fired power plants and the shift towards renewable energy sources and natural gas may have significant implications for local communities and workers. To address these impacts, Italy needs to develop a comprehensive strategy for a just transition that includes measures to support affected communities, provide retraining opportunities for workers, and ensure a fair and equitable distribution of the benefits of the transition.

In conclusion, natural gas is a significant source of energy in Italy, with a wide range of applications in the power, residential and commercial, and industrial sectors. The country’s high dependence on natural gas presents both opportunities and challenges for its energy transition towards a more sustainable and low-carbon energy system. Our report provides valuable insights into the Italian natural gas market, its key players, and its role in the country’s energy mix. We hope that this report will serve as a useful tool for journalists, data-driven companies, and market insiders and contribute to the ongoing discussions about Italy’s energy transition.

With more than 60 pages full of graphs and useful information, our report is a tool for journalists, data-driven companies and marked insider.

Below you can find some excerpts of the content of the book.

If you are interesed in a copy of this selected report, write to us info@htc-sagl.ch

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Category: Big Data

How Does It Work?

Types of Data Centres

Managed Data Centres

Enterprise Data Centres

Colocation Data Centres

Cloud Data Centres

Edge Data Centres

What Are the Data Centre Tiers?

Data centre water use

Preprocessing

Carbonated

Food

Monthly sales

Analysis

Population Data

Conclusion

Conclusion

Further resources

Data Pre-processing & Feature Engineering

What is Data-Driven Decision-Making?

Why is Data-Driven Decision-Making Important?

Erik Brynjolfsson and Data-Driven Decision-Making

DJ Patil and Data-Driven Decision-Making

Hilary Mason and Data-Driven Decision-Making

Examples of Data-Driven Decision-Making

Conclusion