section{Question Generation}

All through the years the hybrid Human machine model has evolved into an efficient model with better tweaks and improvements in algorithms performing next crowdsourcing and entity resolution tasks. In our primary paper 1, an end -to-end pipeline has been introduced which considers the crowd errors to solve the problem of crowdsourcing-based ER. The task of finding best next set of questions in the presence of crowd errors has been considered in the workspaper3PC-Pivot23,but all these methods couldn’t employ the entire clustering as they used local metrics to select next questions.

extbf{Terminologies}

extbf{Uncertain graph} – We have an input set of records namely R = {r1,r2…,rn} which is operated by Entity resolution algorithm to result a set of matching pair of records : C = {R1,R2,…,Rm}, such that Ri int Rj = phi for all i,j, and UiRi = R. Each cluster Ri represents a real world entity.

egin{figure}H

centering%

includegraphicswidth=17pc{figures/Figure6}

caption{ Possibleworlds ofan uncertain graph: Three possible worlds

G 5 , G 6 , G 7 are not clusterings, as they are not transitively closed. For example, in G 5 , A = C and C = B, but A , B, thus violating transitivity.}

label{ClusteringC1C2}

end{figure}

When an answer to a crowdsourcing question is captured in the system, it creates an uncertain and undirected edge between the record pair. Many such edges creates a uncertain undirected graph G=(R,E,p). Each node in the graph denotes a record ri E R and the set of edges between the crowdsourced record pairs is E C RXR. Each edge is assigned with a probability p(ri,rj) between two records ri and rj and is defined by the ratio of crowd workers who voted YES on question on record pair ri and rj.

The uncertain graph G can have 2 |E| detereministic graphs(also called possible worlds), G belonged to G.

extbf{Reliability} – It is a global parameter which measures the connected-ness using the notion of uncertain graphs, within and across the clusters.It creates a balance between stronger and weaker components as identification of the next crowdsourcing question can be based on a weakly connected cluster, or across a pair of clusters separated weakly.

egin{figure}H

centering%

includegraphicswidth=17pc{figures/Figure7}

caption{ extbf{Reliability of Clustering}}

label{ClusteringC1C2}

end{figure}

With a given clustering C = {R1,R2,…,Rm}, and the uncertain graph G = (R,E,p),we consider all the edges inside a cluster as YES edges and the edges across different clusters as NO edges. For an edge e E ,if it is a YES edge, its existence probability is defined as py(e) = p(e), else if it is a NO edge then its existence probability is pn() = 1 – p(e).Then a YES/No graph GY|N = (R,E,pY|N,L) is created from the uncertain graph G as depicted in figure 7. The YES/NO graph has the same sets of edges as of the uncertain graph but are now labeled L(e) with binary values : YES or No.The corresponding probability values are PY|N(e)= PY(e) for a YES edge and PY|N(e)= PN(e) for a NO edge.

The metric reliability is measured with two important elements called Connectivity and Disconnectivity, calculated from the YES/No Graph.

We will have the mathematical definition of all the metrics.

CONNECTIVITY : The connectivity of Ri is defined as the sum of the probability of those possible worlds of YES/No graph G Y|N where all the records in the given cluster Ri are connected by YES edges.

Connect () = SUm I(Ri,G) X P(G),

where I(R,G) is an indicator function over a possible deterministic graph G belongs GY|N and takes value1 1 if records in Ri are connected by YES edges else takes value 0.

DISCONNECTIVITY :Disconnectivity is defined across a pair of clusters Rj,Rk(j