Abinitio interview questions and answers
1. What is a data processing cycle and what is its significance?
Data often needs to be processed continuously and it is used at the same time. It is known as data processing cycle. The same provide results
which are quick or may take extra time depending on the type, size and nature of data. This is boosting the complexity in this approach and thus there is a need of methods that are reliable and advanced
than existing approaches. The data cycle simply make sure that complexity can be avoided upto the possible extent and without doing much.
2. What are the factors on which storage of data depends?
Basically, it depends on the sorting and filtering. In addition to this, it largely depends on the software one uses.
3. Do you think effective communication is necessary in the data processing? What is your strength in terms of same?
The biggest ability that one could have in this domain is the ability to rely on the data
or the information. Of course, communication matters a lot in accomplishing several important tasks such as representation of the information. There are many departments in an organization and
communication make sure things are good and reliable for everyone.
4. Suppose we assign you a new project. What would be your initial point and the key steps that you follow?
The first thing that largely matters is defining the objective of the task and then engages the
team in it. This provides a solid direction for the accomplishment of the task. This is important when one is working on a set of data which is completely unique or fresh. After this, next big thing that needs
attention is effective data modeling. This includes finding the missing values and data validation. Last thing is to track the results.
5. Suppose you find the term Validation mentioned with a set of data, what does that simply represent?
It represents that the concerned data is clean, correct and can thus be used reliably without worrying
about anything. Data validation is widely regarded as the key points in the processing system.
6. What do you mean by data sorting?
It is not always necessary that data remains in a well-defined sequence. In fact, it is always a random collection of objects. Sorting is nothing but arranging the data
items in desired sets or in sequence.
7. Name the technique which you can use for combining the multiple data sets simply?
It is known as Aggregation
8. How scientific data processing is different from commercial data processing?
Scientific data processing simply means data with great amount of computation i.e. arithmetic operations. In this, a limited
amount of data is provided as input and a bulk data is there at the outcome. On the other hand commercial data processing is different. In this, the outcome is limited as compare to the input data. The computational operations are limited in commercial data processing.
9. What are the benefits of data analyzing
It makes sure of the following:
10. What are the key elements of a data processing system?
These are Converter, Aggregator, Validator, Analyzer, Summarizer, and a sorter
11. Name any two stages of the data processing cycle and provide your answer in terms of a comparative study of them?
The first is Collection and second one is preparation of data. Of course, the
collection is the first stage and preparation is the second in a cycle dealing with data processing. The first stage provides baseline to the second and the success and simplicity of the first depends on how
accurately the first has been accomplished. Preparation is mainly the manipulation of important data. Collection break data sets while Preparation joins them together.
12. What do you mean by the overflow errors?
While processing data, calculations which are bulky are often there and it is not always necessary that they fit the memory allocated for them. In case a character of more than 8-bits is stored there, this errors results simply
13. What are the facts that can compromise data integrity?
There are several errors that can cause this issue and can transform many other problems. These are:
1. Bugs and malwares
2. Human error
3. Hardware error
4. Transfer errors which generally include data compression beyond a limit.
14. What is data encoding?
Data needs to be kept confidential in many cases and it can be done through this approach. It simply make sure of information remains in a form which no one else than the
sender and the receiver can understand.
15. What does EDP stand for?
It means Electronic Data Processing
16. Name one method which is generally considered by remote workstation when it comes to processing
17. What do you mean by a transaction file and how it is different from that of a Sort file?
The Transaction file is generally considered to hold input data and that is for the time when a transaction is under process. All the master files can be updated with it simply. Sorting is done to assign a fixed location to the data files on the other hand.
18. What is the use of aggregation when we have rollupas we know rollup component in abinitio is used to summarize group of data record. Then where we will use aggregation?
Aggregation and Rollup both can summarize the data but rollup is much more convenient to use. In order to understand how a particular summarization being rollup is much more explanatory compared to aggregate. Rollup can do some other functionality like input and output filtering of records.Aggregate and rollup perform same action, rollup display intermediate result in main memory, Aggregate does not support intermediate result.
19. What are kinds of layouts does ab initio supports?
Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it’s same as the degree of parallelism.
20. How do you add default rules in transformer?
Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the dropdown. It will show two options – 1) Match Names 2) Wildcard.
21. Do you know what a local lookup is?
If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key. Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much faster than retrieving from disk. It allows the transform component to process the data records of multiple files fast.
22. What is the diff b/w look-up file and look-up, with a relevant example?
Generally, Lookup file represents one or more serial files (Flat files). The amount of data is small enough to be held in the memory. This allows transform functions to retrieve records much more quickly than it could retrieve from Disk.
23. How many components in your most complicated graph?
It depends the type of components you us. Usually avoid using much complicated transform function in a graph.
24. Have you worked with packages?
Multistage transform components by default use packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions.
25. Can sorting and storing be done through single software or you need different for these approaches?
Well, it actually depends on the type and nature of data. Although it is possible to accomplish both these tasks through the same software, many software have their own specialization and it would be good if one adopts such an approach to get the quality outcomes. There are also some pre-defined set of modules and operations that largely matters. If the conditions imposed by them are met, users can perform multiple tasks with the similar software. The output file is provided in the various formats.
26. What are the different forms of output that can be obtained after processing of data?
2. Plain Text files
3. Image files
7. Raw files
Sometime data is required to be produced in more than one format and therefore the software accomplishing this task must have features available in it to keep up the pace in this matter.
27. Give one reason when you need to consider multiple data processing?
When the required files are not the complete outcomes which are required and need further processing.
28. What are the types of data processing you are familiar with?
The fact is data is generally collected from different sources. Thus, the same may vary largely in a number of terms. The fact is this data needs to be passed from various analysis and other processes before it is stored. This process is not as easy as it seems in most of the cases. Thus, processing matter. A lot o time can be saved by processing the data to accomplish various tasks that largely matters. The dependency on the various factors for the reliable operation can also be avoided by to a good extent.
31. What is common among data validity and Data Integrity?
Both these approaches deal with errors related with errors and make sure of smooth flow of operations that largely matters.
32. What do you mean by the term data warehousing? Is it different from Data Mining?
It generally involves the organization as well as the collection of important files in the form of important files. The main aim is to know the exact relation among the industrial data or the full data and the one which is analyzed. Some experts also call it as one of the best available approaches to find errors. It entails the ability to spot problems and enable the operator to find out root causes of the errors.
34. Have you used rollup component? Describe how?
Also need to declare one temporary variable if you want to get counts of a particular group. For each of the group, first it does call the initialize function once, followed by rollup function calls for each of the records in the group and finally calls the finalize function once at the end of last rollup call.
35. How to add default rules in transformer?
Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names — Match names: generates a set of rules that copies input
fields to output fields with the same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name.
1) If it is not already displayed, display the Transform
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then no
need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achieve the functionality.
36. What is the difference between partitioning with key and round robin?
Partition by Key or hash partition ->This is a partitioning technique which is used to partition data when the keys are diverse. If the
key is present in large volume then there can large data skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute
the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin manner.
37. How do you improve the performance of a graph?
There are many ways the performance of the graph can be improved.
- Use a limited number of components in a particular phase
- Use optimum value of max core values for sort and join components
- Minimize the number of sort components
- Minimize sorted join component and if possible replace them by in-memory join/hash join
- Use only required
- fields in the sort, reformat, join components
- Use phasing/flow buffers in case of merge, sorted joins
- If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port For large dataset don’t use broadcast as partitioner
- Minimize the use of regular expression functions like re_index in the transfer functions
- Avoid repartitioning of data unnecessarily
38. How do you truncate a table?
From Abinitio run sql component using the DDL “truncate table by using the truncate table component in Ab Initio
39. Have you ever encountered an error called “depth not equal”?
When two components are linked together if their layout does not match then this problem can occur during the compilation of the graph. A
solution to this problem would be to use a partitioning component in between if there was change in layout.
40. What are primary keys and foreign keys?
In RDBMS the relationship between the two tables is represented as Primary key and foreign key relationship. Whereas the primary key table is the parent table
and foreign key table is the child table. The criteria for both the tables are there should be a matching column.
41. What is an outer join?
An outer join is used when one wants to select all the records from a port – whether it has satisfied the join criteria or not.
42 Mention what is Abinitio?
In Ab initio, dependency analysis is a process through which the EME examines a project entirely and traces how data is transferred and transformed- from component-to-component, field-by-field, within and between graphs.
46. Explain how Abinitio EME is segregated?
Abinition is logically divided into two segments
Data Integration Portion
User Interface ( Access to the meta-data information)
47.Mention how can you connect EME to Abinitio Server?
To connect with Ab initio Server, there are several ways like
Login to EME web interface
Through GDE, you can connect to EME data-store
48.The file extensions used in Abinitio are
- .mp: It stores Ab initio graph or graph component
- .mpc: Custom component or program
- .mdc: Dataset or custom data-set component
- .dml: Data manipulation language file or record type definition
- .xfr: Transform function file
- .dat: Data file (multifile or serial file)
49. Mention what information does a .dbc file extension provides to connect to the database?
The .dbc extension provides the GDE with the information to connect with the database are
Name and version number of the data-base to which you want to connect
Name of the computer on which the data-base instance or server to which you want to connect runs, or on which the database remote access software is installed
Name of the server, database instance or provider to which you want to link
50. Explain how you can run a graph infinitely in Ab initio?
To execute graph infinitely, the graph end script should call the .ksh file of the graph. Therefore, if the graph name is abc.mp then in the end script of the graph it should call to abc.ksh. This will run the
graph for infinitely.