Build secured Storage for structured and unstructured data
A data lake is a centralised secure repository that allows you to store, govern, discover, and share all of your structured and unstructured data at any scale. Data lakes don't require a predefined schema, so you can process raw data without having to know what insights you might want to explore in the future. The following figure shows the key components of a data lake.
Key components of a data lake
- Ingests structured and unstructured data
- Stores, secures and protects data at unlimited scale
- Catalogs and indexes for analysis without data movement
- connects data with analytics and machine learning tools
The challenges Amazon has faced with big data are similar to the challenges many other companies face: data silos, difficulty analysing diverse datasets, data controllership, data security, and incorporating machine learning (ML).
Breaking down silos
A major reason companies choose to create data lakes is to break down data silos. Having pockets of data in different places, controlled by different groups, inherently obscures data. This often happens when a company grows fast and/or acquires new businesses. In the case of Amazon, it's been both. A data lake solves this problem by uniting all the data into one central location. Teams can continue to function as nimble units, but all roads lead back to the data lake for analytics. No more silos.
Analysing diverse datasets
Another challenge of using different systems and approaches to data management is that the data structures and information vary. For example, Amazon Prime has data for fulfilment centres and packaged goods, while AmazonFresh has data for grocery stores and food. Even shipping programs differ internationally. For example, different countries sometimes have different box sizes and shapes. There's also an increasing amount of unstructured data coming from Internet of Things (IoT) devices (like sensors on fulfilment centre machines).
Data lakes allow you to import any amount of data in any format because there is no predefined schema. You can even ingest data in real time. You can collect data from multiple sources and move it into the data lake in its original format. You can also build links between information that might be labelled differently but represents the same thing. Moving all your data to a data lake also improves what you can do with a traditional data warehouse. You have the flexibility to store highly structured, frequently accessed data in a data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in your data lake storage.
Managing data access
With data stored in so many locations, it's difficult to both access all of it and to link to external tools for analysis. Amazon's operations finance data is spread across more than 25 databases with regional teams creating their own local version of datasets. That means over 25 access management credentials for some people. Many of the databases require access management support to do things like change profiles or reset passwords. In addition, audits and controls must be in place for each database to ensure that nobody has improper access.
With a data lake, data is stored in an open format, which makes it easier to work with different analytic services. Open format also makes it more likely for the data to be compatible with tools that don't even exist yet. Various roles in your organisation, like data scientists, data engineers, application developers, and business analysts, can access data with their choice of analytic tools and frameworks. You're not locked in to a small set of tools, and a broader group of people can make sense of the data.
Accelerating machine learning
A data lake is a powerful foundation for ML and AI (artificial intelligence), because ML and AI thrive on large, diverse datasets. ML uses statistical algorithms that learn from existing data, a process called training, to make decisions about new data, a process called inference. During training, patterns and relationships in the data are identified to build a model. The model allows you to make intelligent decisions about data it hasn't encountered before. The more data you have the better you can train your ML models, resulting in improved accuracy.
Using the right tools: Galaxy on AWS
Amazon's retail business uses some technology that predates the creation of Amazon Web Services (AWS), which started in 2006. To become more scalable, efficient, performant, and secure, many workloads in Amazon's retail business have moved to AWS over the last decade. The Galaxy data lake is a critical component of a larger big data platform known internally as Galaxy. The figure below shows some of the ways Galaxy relies on AWS and some of the AWS services it uses.
Although designing a complex data lake solution with AWS might seem a daunting task, we can help in leveraging several tools and processes to ease the design and work out any glitches.
Go Serverless
Going Serverless in AWS is trickier than what it sounds. There are a number of options galore for different use cases. Choosing the right option for the right use case and then following it up with the right strategy for developing, testing, deploying, scaling, performing, securing the underlying data with proactive monitoring requires careful planning. We can help in transforming your legacy applications into a bundle of coherent, inter-active server-less components orchestrated together to serve your business goals.
Essentially Serverless means
No servers to provision or manage
Scales with usage
Never pay for idle
Availability and fault tolerance built in
A list of Serverless offerings
We can help in choosing the right product for your business scenario(s) and couple them with appropriate data gathering services such as CloudTrail, X-ray, etc. to give you better monitoring and governance.
Data Cataloguing
Set up a catalogue, ETL, and data prep with AWS Glue
Serverless provisioning, configuration, and scaling to run your ETL jobs on Apache Spark Pay only for the resources used for jobs Crawl your data sources, identify data formats, and suggest schemas and transformations Automates the effort in building, maintaining, and running ETL jobs.
AWS Glue in action
With AWS Glue, you create jobs using table definitions in your Data Catalog. Jobs consist of scripts that contain the programming logic that performs the transformation. You use triggers to initiate jobs either on a schedule or as a result of a specified event. You determine where your target data resides and which source data populates your target. With your input, AWS Glue generates the code that's required to transform your data from source to target. You can also provide scripts in the AWS Glue console or API to process your data.
Common Use Cases
Although leveraging AWS Glue to catalogue your enterprise data might seem a daunting task, we can help in setting up relevant AWS services and tools to ease the process and work out any glitches.
- Log aggregation with AWS Glue ETL
- Real-Time data collection with Glue ETL
- Data import using Glue databese coccetors
Pattern Identification through Machine Learning
According to an Aberdeen investigation, organisations that implemented data lake strategies achieved 9% higher revenues than similar companies that did not have such a strategy, and among the main reasons was the possibility of performing new data analysis through machine learning.
Among other things, machine learning and data lakes together help companies to generate not only new types of information, but also different and more complex analyses than their competition is performing, which translates into an immediate competitive advantage.
Likewise, this joint strategy allows businesses to obtain reports on more complete historical data, constructs more viable models to forecast probable results, and also obtain suggestions for actions to be implemented to achieve optimal results from the applications, distinguishing by areas. of the company and by processes.
At V2 Technologies we help companies to achieve their strategies of artificial intelligence and of analysis of data, through which they can implement data lakes and machine learning in a simpler way, integrating smarter applications and transforming their data into strategic information, positively impacting the profitability of the business and thus achieving better results.
Image Recognition
Rekognition is a AWS service that provides deep learning visual analysis for your images. Rekognition is very easy to integrate into your application by providing an image or video to the AWS Rekognition API. The service will identify some following: objects, people, text, scenes, and activities. "Amazon Rekognition also provides highly accurate facial analysis and facial recognition. You can detect, analyse, and compare faces for a wide variety of use cases, including user verification, cataloguing, people counting, and public safety." - AWS Official Docs
Amazon Rekognition makes it easy to add image analysis to your applications using proven, highly scalable, deep learning technology that requires no machine learning expertise to use. With Amazon Rekognition, you can identify objects, people, text, scenes, and activities in images, as well as detect any inappropriate content.
Amazon Rekognition also provides highly accurate facial analysis and facial search capabilities that you can use to detect, analyse, and compare faces for a wide variety of user verification, people counting, and public safety use cases. Call us to set up and integrate Amazon Rekognition with your applications and data storages in the AWS landscape.
Insights Dashboard
AWS QuickSight is a fast self-serve business intelligence tool from Amazon. This program allows you to easily and simply create stunning visualisations, build rich dashboards, perform ad-hoc analysis and quickly get business insights in minutes from data. These then can be stored within AWS (Aurora, S3, RDS, etc.) or an external database (SQL Server, MySQL, Teradata, Salesforce, etc.). Digging deeper in results and understanding the correlations visually are a must for running a successful business. QuickSight allows you to easily build your own reports and dashboards intuitively.