This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:iot-reloaded:data_products_development [2024/12/01 13:48] – ktokarz | en:iot-reloaded:data_products_development [2024/12/10 23:26] (current) – pczekalski | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Data Products Development ====== | ||
| + | In the previous chapter, some essential properties of Big Data systems have been discussed and how and why IoT systems relate to Big Data problems. In any IoT implementation, | ||
| + | |||
| + | === Business user === | ||
| + | |||
| + | Business users have good knowledge of the application domain and, in most cases, benefit significantly from the developed data product. They know how to transform data into a business value in the organisation. Typically, they take positions like Production manager, | ||
| + | |||
| + | === Project sponsor === | ||
| + | |||
| + | He is the one who defines the business problem and is triggering the birth of the project. He represents the project' | ||
| + | |||
| + | === Project manager === | ||
| + | |||
| + | As in most software projects, the project manager is responsible for meeting project requirements and specifications within the given time frame and available provisions. He selects the needed talents, chooses development methods and tools, and selects goals for the development team members. Usually, he reports to the project sponsor and ensures that information flows within the team. | ||
| + | |||
| + | === Business information analyst === | ||
| + | |||
| + | He possesses deep knowledge in the given business domain, supported by his skills and experience. Therefore, he is a valuable asset for the team in understanding the data's content, origin, and possible meaning. He defines the key performance indicators (KPI) and metrics to assess the project' | ||
| + | |||
| + | === Database administrator === | ||
| + | |||
| + | He is responsible for configuring the development environment and Database (one, many, or a complex distributed system). In most cases, the configuration must meet specific performance requirements, | ||
| + | |||
| + | === Data engineer === | ||
| + | |||
| + | Data engineers usually have deep technical knowledge of data manipulation methods and techniques. During the project, he tuned data manipulation procedures, SQL queries, and memory management and developed specific stored or server-side procedures. He is responsible for extracting particular data chunks for the Sandbox environment and formatting and tuning them according to data scientists' | ||
| + | |||
| + | === Data scientist === | ||
| + | |||
| + | Develops or selects data processing models needed to meet the project specifications. Develops, tests and implements data processing methods and algorithms; develops decision-making support methods and their implementations for some projects. Provides needed research capacities for selecting and developing the data processing methods and models. | ||
| + | |||
| + | ====== | ||
| + | |||
| + | As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on their competencies and capacities, roles might overlap, or a single team member could provide several roles. | ||
| + | Once the team is built, the development process can start. As with any other product development, | ||
| + | |||
| + | <figure Dataproductlifecycle> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | === Discovery === | ||
| + | |||
| + | The project team learns about the problem domain, the problem itself, its structure, and possible data sources and defines the initial hypothesis. | ||
| + | The phase involves interviewing the stakeholders and other potentially related parties to reach as broad an insight as necessary. It said that during this phase, the problem is farmed – defined the analytical problem, indicators of the success for the potential solutions, business goals and scope. To understand business needs, the project sponsor is involved in the process from the very beginning. The identified data sources might include external systems or APIs, sensors of different types, static data sources, official statistics and other vital sources. | ||
| + | One of the primary outcomes of the phase is the Initial Hypothesis (IH), which concisely represents the team's vision of the problem and potential solution simultaneously. For instance, " | ||
| + | Whatever the IH is, it is a much better starting point than defining the hypothesis during the project implementation in later phases. | ||
| + | |||
| + | === Data preparation === | ||
| + | |||
| + | The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, | ||
| + | - **Data analysis sandbox** - The client' | ||
| + | - **Carrying out ETLs** - The data is retrieved, transformed and loaded back into the sandbox system. Sometimes, simple data filtering excludes outliers and cleans the data. Due to the volume of data, there may be a need for parallelisation of data transfers, which leads to the need for appropriate software and hardware infrastructure. In addition, various web services and interfaces are used to obtain context. | ||
| + | - **Exploring the content of the data** - The main task is to get to know the content of the extracted data. A data catalogue or vocabulary is created (small projects can skip this step). Data research allows for identifying data gaps and technology flaws, as well as teams' own and extraneous data (for determining responsibilities and limitations). | ||
| + | - **Data conditioning** - Slicing and combining are the most common actions in this step. The compatibility of data subsets with each other after the performed manipulations is checked to exclude systematic errors – errors that occur as a result of incorrect manipulation (formatting of data, filling in voids, etc...). During this step, the team ensures the time, metadata, and content match. | ||
| + | - **Reporting and visualising** - This step uses general visualisation techniques, providing a high-level overview – value distributions, | ||
| + | |||
| + | === Model planning === | ||
| + | The main task of the phase is to select model candidates for data clustering, classification or other needs consistent with the Initial Hypothesis from Phase 1. | ||
| + | - **Exploring data and selecting variables** - The aim is to discover and understand variables' | ||
| + | - **Selection of methods or models** - During this step, the team creates a list of methods that match the data and the problem. A typical approach is making many trim model prototypes using ready-made tools and prototyping packages, such as R, SPSS, Excel, Python, and other specific tools. Tools typical of the phase might include but are not limited to R or Python, SQL and OLAP, Matlab, SPSS, and Excel (for simpler models). | ||
| + | |||
| + | === Model development === | ||
| + | During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed: | ||
| + | - **Data preparation** - Specific subsets of data are created, such as training, testing, and validation. The data is adjusted to the selected initial data formatting and structuring methods. | ||
| + | - **Model development** - Usually, conceptually, | ||
| + | - **Model testing** - The models shall be operated and tuned using the selected tools and training datasets to optimise the models and ensure their resilience to incoming data variations. All decisions must be documented! This is important because all other team roles require detailed decision-making reasoning, especially during communication and operationalisation. | ||
| + | - **Key points to be answered during the phase area:** | ||
| + | * Is the model accurate enough? | ||
| + | * Are the results obtained meaningful in relation to the objectives set? | ||
| + | * Don't models make unacceptable mistakes? | ||
| + | * Is the data enough? | ||
| + | In some areas, false positives are more dangerous than false negatives. For example, targeting systems may inadvertently target "their own". | ||
| + | |||
| + | === Communication === | ||
| + | During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to show any drafts outside a group of data scientists! - The methods used by most of those involved are too complex, which leads to incorrect conclusions and unnecessary communication to the team. Usually, the team is biased in not accepting the results, which falsifies the hypotheses, taking it too personally. However, the data led the team to the conclusions, | ||
| + | |||
| + | === Operationalisation === | ||
| + | The results presented are first integrated into the pilot project before full-scale implementation, | ||
| + | Expectations for each of the roles during this phase: | ||
| + | * **Business user:** Identifiable benefits of the model for the business. | ||
| + | * **Project sponsor:** return on investment (ROI) and impact on the business as a whole – how to highlight it outside the organisation / other business. | ||
| + | * **Project manager:** completing the project within the expected deadlines with the intended resources. | ||
| + | * **Business Information Analyst:** add-ons to existing reports and dashboards. | ||
| + | * **Data scientist: | ||